Voice Command Recognition System Based on MFCC and VQ Algorithms

Similar documents
A New Space-Repetition Code Based on One Bit Feedback Compared to Alamouti Space-Time Code

APPLICATION NOTE UNDERSTANDING EFFECTIVE BITS

Measurement of Equivalent Input Distortion AN 20

CHAPTER 5 A NEAR-LOSSLESS RUN-LENGTH CODER

COMPRESSION OF TRANSMULTIPLEXED ACOUSTIC SIGNALS

Fingerprint Classification Based on Directional Image Constructed Using Wavelet Transform Domains

信號與系統 Signals and Systems

信號與系統 Signals and Systems

Intermediate Information Structures

Objectives. Some Basic Terms. Analog and Digital Signals. Analog-to-digital conversion. Parameters of ADC process: Related terms

A study on the efficient compression algorithm of the voice/data integrated multiplexer

x y z HD(x, y) + HD(y, z) HD(x, z)

A New Design of Log-Periodic Dipole Array (LPDA) Antenna

Radar emitter recognition method based on AdaBoost and decision tree Tang Xiaojing1, a, Chen Weigao1 and Zhu Weigang1 1

MEASUREMENT AND CONTORL OF TOTAL HARMONIC DISTORTION IN FREQUENCY RANGE 0,02-10KHZ.

Massachusetts Institute of Technology Dept. of Electrical Engineering and Computer Science Fall Semester, Introduction to EECS 2.

Design of FPGA- Based SPWM Single Phase Full-Bridge Inverter

Using Color Histograms to Recognize People in Real Time Visual Surveillance

Lecture 14. Design of audio WM Model of digital audio CO ( WAV-format): - dependent samples with frequency 44,1кHz, - amplitude of samples is

A SELECTIVE POINTER FORWARDING STRATEGY FOR LOCATION TRACKING IN PERSONAL COMMUNICATION SYSTEMS

INCREASE OF STRAIN GAGE OUTPUT VOLTAGE SIGNALS ACCURACY USING VIRTUAL INSTRUMENT WITH HARMONIC EXCITATION

Design of FPGA Based SPWM Single Phase Inverter

Roberto s Notes on Infinite Series Chapter 1: Series Section 2. Infinite series

PRACTICAL FILTER DESIGN & IMPLEMENTATION LAB

Problem of calculating time delay between pulse arrivals

Subband Coding of Speech Signals Using Decimation and Interpolation

Outline. Motivation. Analog Functional Testing in Mixed-Signal Systems. Motivation and Background. Built-In Self-Test Architecture

GENERATE AND MEASURE STANDING SOUND WAVES IN KUNDT S TUBE.

Lecture 4: Frequency Reuse Concepts

Density Slicing Reference Manual

Application of Improved Genetic Algorithm to Two-side Assembly Line Balancing

Laboratory Exercise 3: Dynamic System Response Laboratory Handout AME 250: Fundamentals of Measurements and Data Analysis

A Novel Small Signal Power Line Quality Measurement System

Summary of Random Variable Concepts April 19, 2000

Estimation of non Distortion Audio Signal Compression

Encode Decode Sample Quantize [ ] [ ]

4. INTERSYMBOL INTERFERENCE

Nonlinear System Identification Based on Reduced Complexity Volterra Models Guodong Jin1,a* and Libin Lu1,b

Department of Electrical and Computer Engineering, Cornell University. ECE 3150: Microelectronics. Spring Due on April 26, 2018 at 7:00 PM

HOW BAD RECEIVER COORDINATES CAN AFFECT GPS TIMING

Fitting Signals into Given Spectrum Modulation Methods

COS 126 Atomic Theory of Matter

PROJECT #2 GENERIC ROBOT SIMULATOR

LINEAR-PHASE FIR FILTERS: THE WINDOWING METHOD

General Model :Algorithms in the Real World. Applications. Block Codes

Introduction to Wireless Communication Systems ECE 476/ECE 501C/CS 513 Winter 2003

ECE 333: Introduction to Communication Networks Fall Lecture 4: Physical layer II

The Institute of Chartered Accountants of Sri Lanka

Introduction to CPM-OFDM: An Energy Efficient Multiple Access Transmission Scheme

High Speed Area Efficient Modulo 2 1

Acquisition of GPS Software Receiver Using Split-Radix FFT

Multisensor transducer based on a parallel fiber optic digital-to-analog converter

Lossless image compression Using Hashing (using collision resolution) Amritpal Singh 1 and Rachna rajpoot 2

Speaker Verification Reinforced by Objective Wavelet Packets-based Speech Parameterization

Tehrani N Journal of Scientific and Engineering Research, 2018, 5(7):1-7

Sampling. Introduction to Digital Data Acquisition: Physical world is analog CSE/EE Digital systems need to

Sensors & Transducers 2015 by IFSA Publishing, S. L.

Reconfigurable architecture of RNS based high speed FIR filter

EMPIRICAL MODE DECOMPOSITION IN AUDIO WATERMARKING BY USING WAVELET METHOD

Data Acquisition System for Electric Vehicle s Driving Motor Test Bench Based on VC++ *

Implementation of Fuzzy Multiple Objective Decision Making Algorithm in a Heterogeneous Mobile Environment

Lab 2: Common Source Amplifier.

ADSP ADSP ADSP ADSP. Advanced Digital Signal Processing (18-792) Spring Fall Semester, Department of Electrical and Computer Engineering

Compression Programs. Compression Outline. Multimedia. Lossless vs. Lossy. Encoding/Decoding. Analysis of Algorithms

Technical Explanation for Counters

Intelligent location of two simultaneously active acoustic emission sources: Part II

Survey of Low Power Techniques for ROMs

Mid-level representations for audio content analysis *Slides for this lecture were created by Anssi Klapuri

An Adaptive Image Denoising Method based on Thresholding

A New Basic Unit for Cascaded Multilevel Inverters with the Capability of Reducing the Number of Switches

X-Bar and S-Squared Charts

NOISE IN A SPECTRUM ANALYZER. Carlo F.M. Carobbi and Fabio Ferrini Department of Information Engineering University of Florence, Italy

Combined Scheme for Fast PN Code Acquisition

Comparison of Frequency Offset Estimation Methods for OFDM Burst Transmission in the Selective Fading Channels

Analysis of SDR GNSS Using MATLAB

Chapter 3 Digital Logic Structures

Logarithms APPENDIX IV. 265 Appendix

AC : USING ELLIPTIC INTEGRALS AND FUNCTIONS TO STUDY LARGE-AMPLITUDE OSCILLATIONS OF A PENDULUM

Measurements of the Communications Environment in Medium Voltage Power Distribution Lines for Wide-Band Power Line Communications

High-Order CCII-Based Mixed-Mode Universal Filter

CP 405/EC 422 MODEL TEST PAPER - 1 PULSE & DIGITAL CIRCUITS. Time: Three Hours Maximum Marks: 100

LETTER A Novel Adaptive Channel Estimation Scheme for DS-CDMA

Optimal Arrangement of Buoys Observable by Means of Radar

The Detection of Abrupt Changes in Fatigue Data by Using Cumulative Sum (CUSUM) Method

E X P E R I M E N T 13

CHAPTER 8 JOINT PAPR REDUCTION AND ICI CANCELLATION IN OFDM SYSTEMS

Chapter 1 The Design of Passive Intermodulation Test System Applied in LTE 2600

A New FDTD Method for the Study of MRI Pulsed Field Gradient- Induced Fields in the Human Body

ADITIONS TO THE METHOD OF ELECTRON BEAM ENERGY MEASUREMENT USING RESONANT ABSORPTION OF LASER LIGHT IN A MAGNETIC FIELD.

Adaptive Fuzzy Color Interpolation

INF 5460 Electronic noise Estimates and countermeasures. Lecture 11 (Mot 8) Sensors Practical examples

Single Bit DACs in a Nutshell. Part I DAC Basics

AkinwaJe, A.T., IbharaJu, F.T. and Arogundade, 0.1'. Department of Computer Sciences University of Agriculture, Abeokuta, Nigeria

Image Contrast Enhancement based Sub-histogram Equalization Technique without Over-equalization Noise

Total Harmonics Distortion Reduction Using Adaptive, Weiner, and Kalman Filters

Adaptive Resource Allocation in Multiuser OFDM Systems

Integrated Detection Method of Underwater Acoustic Fuze Based on IEMD, VIFD and ED

}, how many different strings of length n 1 exist? }, how many different strings of length n 2 exist that contain at least one a 1

Lecture 28: MOSFET as an Amplifier. Small-Signal Equivalent Circuit Models.

Data Mining the Online Encyclopedia of Integer Sequences for New Identities Hieu Nguyen

Transcription:

Voice Commad Recogitio System Based o MFCC ad VQ Algorithms Mahdi Shaeh, ad Azizollah Taheri Abstract The goal of this project is to desig a system to recogitio voice commads. Most of voice recogitio systems cotai two mai modules as follow feature extractio ad feature matchig. I this project, MFCC algorithm is used to simulate feature extractio module. Usig this algorithm, the cepstral coefficiets are calculated o mel frequecy scale. VQ (vector quatizatio) method will be used for reductio of amout of data to decrease computatio time. I the feature matchig stage Euclidea distace is applied as similarity criterio. Because of high accuracy of used algorithms, the accuracy of this voice commad system is high. Usig these algorithms, by at least 5 times repetitio for each commad, i a sigle traiig sessio, ad the twice i each testig sessio zero error rate i recogitio of commads is achieved. Keywords MFCC, Vector quatizatio, Vocal tract, Voice commad. I. INTRODUCTION PEECH processig is oe of most importat braches i S digital sigal processig. Speech sigals ca be used for speech recogitio, speaker recogitio or voice commad recogitio systems. For example i a motorized wheelchair, voice commad recogitio systems ca be utilized istead of usual mechaical commad systems. Proposed voice commad recogitio system icludes two mai stages. First stage cotais feature extractio ad storage of extracted features as traiig data. Secod stage is test. I this stage, features of a ew etered commad are extracted. These features are used i order to make compariso with stored features to recogize commad. MFCC algorithm is used for feature extractio ad vector quatizatio algorithm is used to reduce amout of achieved data i form of codebooks. These data are saved as acoustic vectors. I the matchig stage, features of iput commad are compared with each codebook usig Euclidea distace criterio. This paper is orgaized as follows. I sectio II proposed method is detailed, sectio III cotais experimetal result ad sectio IV is coclusio. brach, Ira (e- Authors are with Islamic Azad Uiversity, Najafabad mails: mahdishaeh@yahoo.com, taheri_az@yahoo.com). II. VOICE COMMAND RECOGNITION SYSTEM I this sectio, first speech productio mechaism, voiced ad uvoiced souds ad formats are described. After familiarizig with these cocepts, mai parts of proposed recogitio system, feature extractio ad feature matchig will be described. A. Speech Productio The speech sigal is a acoustic soud pressure wave that origiates by exitig of air from vocal tract ad volutary movemet of aatomical structure. Fig. 1 shows schematic diagram of the huma speech productio mechaism. The compoets of this system are the lugs, trachea laryx (orga of voice productio), pharygeal cavity, oral cavity ad asal cavity [1]. Fig. 1 Schematic diagram of the huma speech productio mechaism I techical discussio, the pharygeal ad oral cavities are usually called the "vocal tract". Therefore the vocal tract begis at the output of the laryx ad termiates at the iput of lips. Fier aatomical compoets critical to speech productio iclude the vocal cords, soft palate or velum, togue, teeth, ad lips. These compoets ca move to differet positio to chage the size ad shape of vocal tract ad produce various speech soud. For egieerig purposes, 534

we ca cosider the speech productio mechaism i term of a acoustic filterig operatio. Thus, istead of aatomical model (Fig. 1), a techical model for speech productio ca be cosidered (Fig. 2). This filter is excited by the orgas below it. However about vocal tract, durig speakig there is varyig i shape of this tube. So the resoace frequecies are chagig. These resoace frequecies are called formats. We ca characterize shape of vocal tract by these formats. For each voiced soud, there are ifiite umber of formats, but usually a few first of them are used. But for uvoiced soud, there is ot ay resoace frequecy, because there is o periodic (or quasi-periodic) excitatio i vocal tract. Fig. 4 shows formats of "i" ad "o" as example of voiced souds formats. Fig. 2 Techical model for speech productio I speech processig, there are two fudametal excitatio types "voiced" ad "uvoiced". Voiced souds are produced by forcig air through the glottis. Therefore the vocal cords vibrate. This vibratio i vocal cords produces quasi-periodic airflow through vocal tract. By this meas, laryx produces a periodic excitatio to the system. The soud produced i this way is called "voice"[1]. Uvoiced souds are produced whe laryx is ope ad there is o vibratio i vocal cords, so flowig air through the vocal tract is ot periodic. Thus uvoiced soud has low amplitude ad oisy form. Ayway, durig the voiced soud productio, we have a periodic sigal ad the vocal tract with varyig shape as a fuctio of time. The vocal tract is a o-uiform acoustic tube. For a uiform tube, the resoace frequecies are obtaied as follows: C F i = (2i 1) for i = 1,2,3, (1) 4L Where legth of tube, L=17.5 cm (almost equal to a adult huma vocal tract legth) ad C= speed of soud. Therefore we obtai differet resoace frequecy for this tube (i this case 500Hz, 1500Hz, 2500Hz ). Fig. 3 Resoace frequecies for a uiform acoustic tube Fig. 4 Formats of "i" ad "o" vowels The formats are useful for speech depedet recogitio systems. Usig these formats, vocal tract ad also utterace vowels ca be characterized. Because i our system the commads have differet vowels, a iput commad ca be recogized via compariso of its characteristics with stored characteristics i database. B. Feature Extractio Before idetifyig or traiig a commad that should be idetified by the system, the voice sigal must be processed to extract importat characteristics of speech. Pitch frequecy ad formats are most importat features of voice sigal. Pitch is fudametal frequecy of speech sigal. The pitch frequecy correspods to the fudametal frequecy of vocal cord vibratios. Pitch is a characteristic of excitatio source. Formats are resoace frequecies of vocal tract ad so they are characteristics of vocal tract. Fig. 5 shows geeral liear discrete-time model for speech productio [1]. 535

Accordig to this model, speech sigal, S(), is composed of a covolved combiatio of excitatio sigal, with the vocal tract impulse respose. 6). That filter bak has a triagular bad pass frequecy respose, ad the spacig as well as the badwidth is determied by a costat mel frequecy iterval. Fig. 5 A geeral discrete-time model for speech productio We have access oly to the output sigal, S(), but we eed separated e() ad θ() for recogizig the commad. Because idividual parts are ot combied liearity, the cepstral aalysis is used to separate e() ad θ(). I order to feature extractio, calculatio of cepstral coefficiets i mel frequecy scale is required. C. Cepstral Aalysis Cepstral is a time domai aalysis that its mai idea is separatio of two covolved sigals [1]. The output sigal of speech productio system S(), is as follows: s( ) = e( ) * θ ( ) (2) Usig Fourier trasform we have: s( = E( θ ( (3) With takig logarithm, followig equatio is obtaied: log s( = log E( + logθ( (4) This equatio is show as follows: cs( = ce( + cθ ( (5) Usig IDFT, the cepstral coefficiets are obtaied. cs( ) = ce( ) + cθ ( ) (6) I other word, cepstral coefficiets are computed i the form of: 1 cs( ) = f (log[ f ( s( ))] (7) D. Mel-frequecy Scalig Physiological studies have show that huma auditory system does ot follow a liear scale. Thus for each toe with a actual frequecy, f, measured i Hz, a subjective pitch is mapped o a scale called the mel scale. The mel-frequecy scale is a liear frequecy spacig below 1000 Hz ad a logarithmic spacig above 1000 Hz. The mai advatage of usig mel frequecy scalig is that mel frequecy scalig is very approximate to the frequecy respose of huma auditory systems ad ca be used to capture the phoetically importat characteristics of speech. Oe approach for simulatig the subjective spectrum is to use a filter bak, spaced uiformly o the mel scale (see Fig. Fig. 6 Mel spaced filter bak The relatio betwee liear frequecy ad mel frequecy is as follows: Mel(f)=2595* log 10 (1+ f / 700) (8) E. MFCC Computatio A block diagram of the structure of a MFCC processor is as show (Fig. 7). The mai purpose of the MFCC processor is to mimic the behavior of the huma ears. I first step, the cotiuous speech sigal is blocked ito frames of N samples, with adjacet frames beig separated by M (M < N). Typical values for N ad M are N = 256 ad M = 100. The ext step i the processig is to widow each idividual frame so as to miimize the sigal discotiuities at the begiig ad ed of each frame. Typically the Hammig widow is used. Fig. 7 MFCC calculatio The ext processig step is the Fast Fourier Trasform, which coverts each frame of N samples from the time domai ito the frequecy domai. After that the scale of frequecy is coverted from liear to mel scale. The logarithm is take from the results. I fial step, the log mel spectrum is coverted back to time domai. The result is called the mel frequecy cepstrum coefficiets (MFCC). The cepstral represetatio of the speech cepstrum provides a good represetatio of the local spectral properties of the sigal. Usig triagular filter bak, we obtai sigificat decrease i amout of data. But for more simplicity i ext computatios, more decreasig i amout of data is eeded. For this purpose vector quatizatio algorithm is used [5]. 536

F. Vector Quatizatio Vector quatizatio (VQ) is used for commad idetificatio i our system. VQ is a process of mappig vectors of a large vector space to a fiite umber of regios i that space. Each regio is called a cluster ad is represeted by its ceter (called a cetroid). A collectio of all the cetroids make up a codebook. The amout of data is sigificatly less, sice the umber of cetroids is at least te times smaller tha the umber of vectors i the origial sample. This will reduce the amout of computatios eeded whe comparig i later stages [2],[4]. Eve though the codebook is smaller tha the origial sample, it still accurately represets commad characteristics. The oly differece is that there will be some spectral distortio. G. Codebook Geeratio There are may differet algorithms to create a codebook. Sice commad recogitio depeds o the geerated codebooks, it is importat to select a algorithm that will best represet the origial sample. For our system, the LBG algorithm (also kow as the biary split algorithm) is used. The algorithm is implemeted by the followig recursive procedure [2], [5],[6] : 1. Desig a 1-vector codebook; this is the cetroid of the etire set of traiig vectors (hece, o iteratio is required here). y y + = y = y (1 + ε ) (1 ε ) 3. Nearest-Neighbor Search: for each traiig vector, fid the cetroid i the curret codebook that is closest (i terms of similarity measuremet), ad assig that vector to the correspodig cell (associated with the closest cetroid). This is doe usig the K-meas iterative algorithm. 4. Cetroid Update: update the cetroid i each cell usig the cetroid of the traiig vectors assiged to that cell. 5. Iteratio 1: repeat steps 3 ad 4 util the average distace falls below a preset threshold 6. Iteratio 2: repeat steps 2, 3, ad 4 util a codebook of size M is reached. H. Commad Matchig I the recogitio phase the features of ukow commad are extracted ad represeted by a sequece of feature vectors {x 1 x }. Each feature vector i the sequece X is compared with all the stored codewords i codebook, ad the codeword with the miimum distace from the feature vectors is selected as proposed commad For each codebook a distace measure is computed, ad the commad with the lowest distace is chose. Oe way to defie the distace measure is to use the Euclidea distaces: 1 2 D = ( ( x j ) ) 2 i y (10) Fig. 9 describes the schematic of the Nearest Neighbor search [4]. (9) Fig. 9 A schematic of the Nearest eighbor search o the VQ decodig process As we see, the search of the earest vector is doe exhaustively, by fidig the distace betwee the iput vector X ad each of the codewords C1-CM from the codebook C. The oe with the smallest distace is coded as the output commad. Fig. 8 The process of VQ codebook geeratio; the features are show by blue dots, the group boudary i gree ad the cetroids are i red 2. Double the size of the codebook by splittig each curret codebook y accordig to the rule: where varies from 1 to the curret size of the codebook, ad e is the splittig parameter. For our system, e = 0.001. III. EXPERIMENTAL RESULTS To implemet proposed voice commad recogitio system, a system with 20 voice commads was cosidered. Some Commads are as follow: start, stop, up, dow, forward, backward, icrease, decrease, left, right, fast ad slow. Traiig phase was doe i two forms. First system was traied with oe repetitio for each commad ad oce i 537

each testig sessios. With this type of traiig error rate is about 15%. I secod form, speaker repeated the words 5 times i a sigle traiig sessio, ad the twice i each testig sessio. By doig this zero error rate i recogitio of commads was achieved. IV. CONCLUSION As a result of chages i shape of huma vocal tract durig geeratio of differet words, resoace frequecies of vocal tract, formats, also chages. Usig this pheomeo, we ca extract voice features of each commad ad we ca implemet a voice commad recogitio system. I traiig phase, if stated voice commads cotai more vowel differeces betwee them, we will have more accurate recogitio system. Accuracy of system also icreases if we icrease umber of repetitios for each commad i traiig stage. REFERENCES [1] Deller J.R. Hase, J.H.L &.Proakis J.G., (1993), Discrete-Time Processig of Speech Sigal, New York, Macmilla Publishig Compay. [2] Rabier, L. R. ad Juag, B.-H. (1993), Fudametals of Speech Recogitio, Pretice-Hall, Eglewood Cli_s, NJ. [3] Wither Jørgese ad Lasse Lohilahti Mølgaard, IMM-THESIS-2006, Tools for Automatic Audio Idexig [4] Christia Spaer 2005, Speech codec idetificatio for Error Correctio of Across-Chael effects i speech coded eviromets [5] B. Richard, jauary, 2001, "Text-idepedet speaker recogitio usig source based features", Master of philosophy, Wildermoth Griffith Uiversity Australia [6] Tejaswii Hebalkar, Sprig 2000 Voice Recogitio ad Idetificatio System Fial Report 18-551 Digital Commuicatios ad Sigal Processig Systems Desig [7] Nilsso Magus, October 2001, Speaker Verificatio i JAVA, A thesis submitted i partial fulfillmet of the requiremets for the degree of Master of Computer ad Iformatio Egieerig, School of Microelectroic Egieerig, Griffith Uiversity. 538