Voice Command Recognition System Based on MFCC and VQ Algorithms

Voice Commad Recogitio System Based o MFCC ad VQ Algorithms Mahdi Shaeh, ad Azizollah Taheri Abstract The goal of this project is to desig a system to recogitio voice commads. Most of voice recogitio systems cotai two mai modules as follow feature extractio ad feature matchig. I this project, MFCC algorithm is used to simulate feature extractio module. Usig this algorithm, the cepstral coefficiets are calculated o mel frequecy scale. VQ (vector quatizatio) method will be used for reductio of amout of data to decrease computatio time. I the feature matchig stage Euclidea distace is applied as similarity criterio. Because of high accuracy of used algorithms, the accuracy of this voice commad system is high. Usig these algorithms, by at least 5 times repetitio for each commad, i a sigle traiig sessio, ad the twice i each testig sessio zero error rate i recogitio of commads is achieved. Keywords MFCC, Vector quatizatio, Vocal tract, Voice commad. I. INTRODUCTION PEECH processig is oe of most importat braches i S digital sigal processig. Speech sigals ca be used for speech recogitio, speaker recogitio or voice commad recogitio systems. For example i a motorized wheelchair, voice commad recogitio systems ca be utilized istead of usual mechaical commad systems. Proposed voice commad recogitio system icludes two mai stages. First stage cotais feature extractio ad storage of extracted features as traiig data. Secod stage is test. I this stage, features of a ew etered commad are extracted. These features are used i order to make compariso with stored features to recogize commad. MFCC algorithm is used for feature extractio ad vector quatizatio algorithm is used to reduce amout of achieved data i form of codebooks. These data are saved as acoustic vectors. I the matchig stage, features of iput commad are compared with each codebook usig Euclidea distace criterio. This paper is orgaized as follows. I sectio II proposed method is detailed, sectio III cotais experimetal result ad sectio IV is coclusio. brach, Ira (e- Authors are with Islamic Azad Uiversity, Najafabad mails: mahdishaeh@yahoo.com, taheri_az@yahoo.com). II. VOICE COMMAND RECOGNITION SYSTEM I this sectio, first speech productio mechaism, voiced ad uvoiced souds ad formats are described. After familiarizig with these cocepts, mai parts of proposed recogitio system, feature extractio ad feature matchig will be described. A. Speech Productio The speech sigal is a acoustic soud pressure wave that origiates by exitig of air from vocal tract ad volutary movemet of aatomical structure. Fig. 1 shows schematic diagram of the huma speech productio mechaism. The compoets of this system are the lugs, trachea laryx (orga of voice productio), pharygeal cavity, oral cavity ad asal cavity [1]. Fig. 1 Schematic diagram of the huma speech productio mechaism I techical discussio, the pharygeal ad oral cavities are usually called the "vocal tract". Therefore the vocal tract begis at the output of the laryx ad termiates at the iput of lips. Fier aatomical compoets critical to speech productio iclude the vocal cords, soft palate or velum, togue, teeth, ad lips. These compoets ca move to differet positio to chage the size ad shape of vocal tract ad produce various speech soud. For egieerig purposes, 534

we ca cosider the speech productio mechaism i term of a acoustic filterig operatio. Thus, istead of aatomical model (Fig. 1), a techical model for speech productio ca be cosidered (Fig. 2). This filter is excited by the orgas below it. However about vocal tract, durig speakig there is varyig i shape of this tube. So the resoace frequecies are chagig. These resoace frequecies are called formats. We ca characterize shape of vocal tract by these formats. For each voiced soud, there are ifiite umber of formats, but usually a few first of them are used. But for uvoiced soud, there is ot ay resoace frequecy, because there is o periodic (or quasi-periodic) excitatio i vocal tract. Fig. 4 shows formats of "i" ad "o" as example of voiced souds formats. Fig. 2 Techical model for speech productio I speech processig, there are two fudametal excitatio types "voiced" ad "uvoiced". Voiced souds are produced by forcig air through the glottis. Therefore the vocal cords vibrate. This vibratio i vocal cords produces quasi-periodic airflow through vocal tract. By this meas, laryx produces a periodic excitatio to the system. The soud produced i this way is called "voice"[1]. Uvoiced souds are produced whe laryx is ope ad there is o vibratio i vocal cords, so flowig air through the vocal tract is ot periodic. Thus uvoiced soud has low amplitude ad oisy form. Ayway, durig the voiced soud productio, we have a periodic sigal ad the vocal tract with varyig shape as a fuctio of time. The vocal tract is a o-uiform acoustic tube. For a uiform tube, the resoace frequecies are obtaied as follows: C F i = (2i 1) for i = 1,2,3, (1) 4L Where legth of tube, L=17.5 cm (almost equal to a adult huma vocal tract legth) ad C= speed of soud. Therefore we obtai differet resoace frequecy for this tube (i this case 500Hz, 1500Hz, 2500Hz ). Fig. 3 Resoace frequecies for a uiform acoustic tube Fig. 4 Formats of "i" ad "o" vowels The formats are useful for speech depedet recogitio systems. Usig these formats, vocal tract ad also utterace vowels ca be characterized. Because i our system the commads have differet vowels, a iput commad ca be recogized via compariso of its characteristics with stored characteristics i database. B. Feature Extractio Before idetifyig or traiig a commad that should be idetified by the system, the voice sigal must be processed to extract importat characteristics of speech. Pitch frequecy ad formats are most importat features of voice sigal. Pitch is fudametal frequecy of speech sigal. The pitch frequecy correspods to the fudametal frequecy of vocal cord vibratios. Pitch is a characteristic of excitatio source. Formats are resoace frequecies of vocal tract ad so they are characteristics of vocal tract. Fig. 5 shows geeral liear discrete-time model for speech productio [1]. 535

Accordig to this model, speech sigal, S(), is composed of a covolved combiatio of excitatio sigal, with the vocal tract impulse respose. 6). That filter bak has a triagular bad pass frequecy respose, ad the spacig as well as the badwidth is determied by a costat mel frequecy iterval. Fig. 5 A geeral discrete-time model for speech productio We have access oly to the output sigal, S(), but we eed separated e() ad θ() for recogizig the commad. Because idividual parts are ot combied liearity, the cepstral aalysis is used to separate e() ad θ(). I order to feature extractio, calculatio of cepstral coefficiets i mel frequecy scale is required. C. Cepstral Aalysis Cepstral is a time domai aalysis that its mai idea is separatio of two covolved sigals [1]. The output sigal of speech productio system S(), is as follows: s( ) = e( ) * θ ( ) (2) Usig Fourier trasform we have: s( = E( θ ( (3) With takig logarithm, followig equatio is obtaied: log s( = log E( + logθ( (4) This equatio is show as follows: cs( = ce( + cθ ( (5) Usig IDFT, the cepstral coefficiets are obtaied. cs( ) = ce( ) + cθ ( ) (6) I other word, cepstral coefficiets are computed i the form of: 1 cs( ) = f (log[ f ( s( ))] (7) D. Mel-frequecy Scalig Physiological studies have show that huma auditory system does ot follow a liear scale. Thus for each toe with a actual frequecy, f, measured i Hz, a subjective pitch is mapped o a scale called the mel scale. The mel-frequecy scale is a liear frequecy spacig below 1000 Hz ad a logarithmic spacig above 1000 Hz. The mai advatage of usig mel frequecy scalig is that mel frequecy scalig is very approximate to the frequecy respose of huma auditory systems ad ca be used to capture the phoetically importat characteristics of speech. Oe approach for simulatig the subjective spectrum is to use a filter bak, spaced uiformly o the mel scale (see Fig. Fig. 6 Mel spaced filter bak The relatio betwee liear frequecy ad mel frequecy is as follows: Mel(f)=2595* log 10 (1+ f / 700) (8) E. MFCC Computatio A block diagram of the structure of a MFCC processor is as show (Fig. 7). The mai purpose of the MFCC processor is to mimic the behavior of the huma ears. I first step, the cotiuous speech sigal is blocked ito frames of N samples, with adjacet frames beig separated by M (M < N). Typical values for N ad M are N = 256 ad M = 100. The ext step i the processig is to widow each idividual frame so as to miimize the sigal discotiuities at the begiig ad ed of each frame. Typically the Hammig widow is used. Fig. 7 MFCC calculatio The ext processig step is the Fast Fourier Trasform, which coverts each frame of N samples from the time domai ito the frequecy domai. After that the scale of frequecy is coverted from liear to mel scale. The logarithm is take from the results. I fial step, the log mel spectrum is coverted back to time domai. The result is called the mel frequecy cepstrum coefficiets (MFCC). The cepstral represetatio of the speech cepstrum provides a good represetatio of the local spectral properties of the sigal. Usig triagular filter bak, we obtai sigificat decrease i amout of data. But for more simplicity i ext computatios, more decreasig i amout of data is eeded. For this purpose vector quatizatio algorithm is used [5]. 536

F. Vector Quatizatio Vector quatizatio (VQ) is used for commad idetificatio i our system. VQ is a process of mappig vectors of a large vector space to a fiite umber of regios i that space. Each regio is called a cluster ad is represeted by its ceter (called a cetroid). A collectio of all the cetroids make up a codebook. The amout of data is sigificatly less, sice the umber of cetroids is at least te times smaller tha the umber of vectors i the origial sample. This will reduce the amout of computatios eeded whe comparig i later stages [2],[4]. Eve though the codebook is smaller tha the origial sample, it still accurately represets commad characteristics. The oly differece is that there will be some spectral distortio. G. Codebook Geeratio There are may differet algorithms to create a codebook. Sice commad recogitio depeds o the geerated codebooks, it is importat to select a algorithm that will best represet the origial sample. For our system, the LBG algorithm (also kow as the biary split algorithm) is used. The algorithm is implemeted by the followig recursive procedure [2], [5],[6] : 1. Desig a 1-vector codebook; this is the cetroid of the etire set of traiig vectors (hece, o iteratio is required here). y y + = y = y (1 + ε ) (1 ε ) 3. Nearest-Neighbor Search: for each traiig vector, fid the cetroid i the curret codebook that is closest (i terms of similarity measuremet), ad assig that vector to the correspodig cell (associated with the closest cetroid). This is doe usig the K-meas iterative algorithm. 4. Cetroid Update: update the cetroid i each cell usig the cetroid of the traiig vectors assiged to that cell. 5. Iteratio 1: repeat steps 3 ad 4 util the average distace falls below a preset threshold 6. Iteratio 2: repeat steps 2, 3, ad 4 util a codebook of size M is reached. H. Commad Matchig I the recogitio phase the features of ukow commad are extracted ad represeted by a sequece of feature vectors {x 1 x }. Each feature vector i the sequece X is compared with all the stored codewords i codebook, ad the codeword with the miimum distace from the feature vectors is selected as proposed commad For each codebook a distace measure is computed, ad the commad with the lowest distace is chose. Oe way to defie the distace measure is to use the Euclidea distaces: 1 2 D = ( ( x j ) ) 2 i y (10) Fig. 9 describes the schematic of the Nearest Neighbor search [4]. (9) Fig. 9 A schematic of the Nearest eighbor search o the VQ decodig process As we see, the search of the earest vector is doe exhaustively, by fidig the distace betwee the iput vector X ad each of the codewords C1-CM from the codebook C. The oe with the smallest distace is coded as the output commad. Fig. 8 The process of VQ codebook geeratio; the features are show by blue dots, the group boudary i gree ad the cetroids are i red 2. Double the size of the codebook by splittig each curret codebook y accordig to the rule: where varies from 1 to the curret size of the codebook, ad e is the splittig parameter. For our system, e = 0.001. III. EXPERIMENTAL RESULTS To implemet proposed voice commad recogitio system, a system with 20 voice commads was cosidered. Some Commads are as follow: start, stop, up, dow, forward, backward, icrease, decrease, left, right, fast ad slow. Traiig phase was doe i two forms. First system was traied with oe repetitio for each commad ad oce i 537

each testig sessios. With this type of traiig error rate is about 15%. I secod form, speaker repeated the words 5 times i a sigle traiig sessio, ad the twice i each testig sessio. By doig this zero error rate i recogitio of commads was achieved. IV. CONCLUSION As a result of chages i shape of huma vocal tract durig geeratio of differet words, resoace frequecies of vocal tract, formats, also chages. Usig this pheomeo, we ca extract voice features of each commad ad we ca implemet a voice commad recogitio system. I traiig phase, if stated voice commads cotai more vowel differeces betwee them, we will have more accurate recogitio system. Accuracy of system also icreases if we icrease umber of repetitios for each commad i traiig stage. REFERENCES [1] Deller J.R. Hase, J.H.L &.Proakis J.G., (1993), Discrete-Time Processig of Speech Sigal, New York, Macmilla Publishig Compay. [2] Rabier, L. R. ad Juag, B.-H. (1993), Fudametals of Speech Recogitio, Pretice-Hall, Eglewood Cli_s, NJ. [3] Wither Jørgese ad Lasse Lohilahti Mølgaard, IMM-THESIS-2006, Tools for Automatic Audio Idexig [4] Christia Spaer 2005, Speech codec idetificatio for Error Correctio of Across-Chael effects i speech coded eviromets [5] B. Richard, jauary, 2001, "Text-idepedet speaker recogitio usig source based features", Master of philosophy, Wildermoth Griffith Uiversity Australia [6] Tejaswii Hebalkar, Sprig 2000 Voice Recogitio ad Idetificatio System Fial Report 18-551 Digital Commuicatios ad Sigal Processig Systems Desig [7] Nilsso Magus, October 2001, Speaker Verificatio i JAVA, A thesis submitted i partial fulfillmet of the requiremets for the degree of Master of Computer ad Iformatio Egieerig, School of Microelectroic Egieerig, Griffith Uiversity. 538