Speech Recognition on Robot Controller

Similar documents
Mel Spectrum Analysis of Speech Recognition using Single Microphone

Audio Similarity. Mark Zadel MUMT 611 March 8, Audio Similarity p.1/23

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Implementing Speaker Recognition

SIMULATION VOICE RECOGNITION SYSTEM FOR CONTROLING ROBOTIC APPLICATIONS

A Novel Speech Controller for Radio Amateurs with a Vision Impairment

A Real Time Noise-Robust Speech Recognition System

Cepstrum alanysis of speech signals

Audio Fingerprinting using Fractional Fourier Transform

Isolated Word Recognition Based on Combination of Multiple Noise-Robust Techniques

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

FPGA Design of Speech Compression by Using Discrete Wavelet Transform

JOURNAL OF OBJECT TECHNOLOGY

Gammatone Cepstral Coefficient for Speaker Identification

Vocal Command Recognition Using Parallel Processing of Multiple Confidence-Weighted Algorithms in an FPGA

VECTOR QUANTIZATION-BASED SPEECH RECOGNITION SYSTEM FOR HOME APPLIANCES

Autonomous Vehicle Speaker Verification System

Decision Based Median Filter Algorithm Using Resource Optimized FPGA to Extract Impulse Noise

Isolated Digit Recognition Using MFCC AND DTW

Determining Guava Freshness by Flicking Signal Recognition Using HMM Acoustic Models

VLSI Implementation of Digital Down Converter (DDC)

Journal of Engineering Science and Technology Review 9 (5) (2016) Research Article. L. Pyrgas, A. Kalantzopoulos* and E. Zigouris.

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

REAL TIME DIGITAL SIGNAL PROCESSING. Introduction

CSCD 433 Network Programming Fall Lecture 5 Physical Layer Continued

Using Soft Multipliers with Stratix & Stratix GX

I hope you have completed Part 2 of the Experiment and is ready for Part 3.

FOURIER analysis is a well-known method for nonparametric

Auditory Based Feature Vectors for Speech Recognition Systems

Enabling New Speech Driven Services for Mobile Devices: An overview of the ETSI standards activities for Distributed Speech Recognition Front-ends

Topic. Spectrogram Chromagram Cesptrogram. Bryan Pardo, 2008, Northwestern University EECS 352: Machine Perception of Music and Audio

USING EMBEDDED PROCESSORS IN HARDWARE MODELS OF ARTIFICIAL NEURAL NETWORKS

Speech Synthesis using Mel-Cepstral Coefficient Feature

Digital Logic, Algorithms, and Functions for the CEBAF Upgrade LLRF System Hai Dong, Curt Hovater, John Musson, and Tomasz Plawski

Calibration of Microphone Arrays for Improved Speech Recognition

FPGA-BASED PULSED-RF PHASE AND AMPLITUDE DETECTOR AT SLRI

Performance analysis of voice activity detection algorithm for robust speech recognition system under different noisy environment

A DEVICE FOR AUTOMATIC SPEECH RECOGNITION*

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Speech Signal Analysis

Digital Receiver Experiment or Reality. Harry Schultz AOC Aardvark Roost Conference Pretoria 13 November 2008

DEVELOPMENT OF FPGA BASED CONTROL ARCHITECTURE FOR PMSM DRIVES

Speech Synthesis; Pitch Detection and Vocoders

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Colour Recognizing Robot Arm Equipped with a CMOS Camera and an FPGA

CSCD 433 Network Programming Fall Lecture 5 Physical Layer Continued

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

Evaluation of channel estimation combined with ICI self-cancellation scheme in doubly selective fading channel

FPGA Based 70MHz Digital Receiver for RADAR Applications

Voice Recognition Technology Using Neural Networks

Audio processing methods on marine mammal vocalizations

Design and Characterization of ECC IP core using Improved Hamming Code

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Audio Visualiser using Field Programmable Gate Array(FPGA)

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients

VLSI Implementation of Impulse Noise Suppression in Images

VOICE COMMAND RECOGNITION SYSTEM BASED ON MFCC AND DTW

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

Using the VM1010 Wake-on-Sound Microphone and ZeroPower Listening TM Technology

ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

On Design and Implementation of an Embedded Automatic Speech Recognition System

Different Approaches of Spectral Subtraction Method for Speech Enhancement

FPGA-based Digital Signal Processing Trainer

Lecture 6: Electronics Beyond the Logic Switches Xufeng Kou School of Information Science and Technology ShanghaiTech University

ECEn 487 Digital Signal Processing Laboratory. Lab 3 FFT-based Spectrum Analyzer

Robotic Control using Speech Recognition and Android

A Self-Contained Large-Scale FPAA Development Platform

FPGA implementation of DWT for Audio Watermarking Application

High-speed Noise Cancellation with Microphone Array

CS 188: Artificial Intelligence Spring Speech in an Hour

An Approach to Very Low Bit Rate Speech Coding

ADQ214. Datasheet. Features. Introduction. Applications. Software support. ADQ Development Kit. Ordering information

A HIGH PERFORMANCE HARDWARE ARCHITECTURE FOR HALF-PIXEL ACCURATE H.264 MOTION ESTIMATION

IMPLEMENTATION OF SPEECH RECOGNITION SYSTEM USING DSP PROCESSOR ADSP2181

Coming to Grips with the Frequency Domain

FPGA Co-Processing Solutions for High-Performance Signal Processing Applications. 101 Innovation Dr., MS: N. First Street, Suite 310

Sound Synthesis Methods

Campus Location Recognition using Audio Signals

Electronic disguised voice identification based on Mel- Frequency Cepstral Coefficient analysis

A LOW POWER SINGLE PHASE CLOCK DISTRIBUTION USING 4/5 PRESCALER TECHNIQUE

Chapter 1: Introduction to audio signal processing

Improving Meetings with Microphone Array Algorithms. Ivan Tashev Microsoft Research

Optimized BPSK and QAM Techniques for OFDM Systems

DESIGN AND IMPLEMENTATION OF AN ALGORITHM FOR MODULATION IDENTIFICATION OF ANALOG AND DIGITAL SIGNALS

REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

Music Genre Classification using Improved Artificial Neural Network with Fixed Size Momentum

FPGA-Based Autonomous Obstacle Avoidance Robot.

Design Document. Embedded System Design CSEE Spring 2012 Semester. Academic supervisor: Professor Stephen Edwards

Design and Implementation of Universal Serial Bus Transceiver with Verilog

Performance Analysis of FFT Filter to Measure Displacement Signal in Road Roughness Profiler

Aerial Photographic System Using an Unmanned Aerial Vehicle

Merging Propagation Physics, Theory and Hardware in Wireless. Ada Poon

Wavelet Packets Best Tree 4 Points Encoded (BTE) Features

Signal Processing and Display of LFMCW Radar on a Chip

Hum-Power Controller for Powered Wheelchairs

RECENTLY, there has been an increasing interest in noisy

Transcription:

Speech Recognition on Robot Controller Implemented on FPGA Phan Dinh Duy, Vu Duc Lung, Nguyen Quang Duy Trang, and Nguyen Cong Toan University of Information Technology, National University Ho Chi Minh City Ho Chi Minh City, Vietnam Email: {lungvd, duypd}@uit.edu.vn Abstract This paper is about a speech recognition system for robot control using the DE2 development kit, which is being used at Computer Engineer Department of the University of Information Technology. Hardware devices of the system are an Altera DE2 development kit and a Philips SHM1000 microphone. The system includes a hardware design in the Verilog HDL and software in Embedded C for Nios II. The core of the system is the hardware on the FPGA, it includes four main components: module for receiving and converting audio signal; memory controller; FFT module and a Nios II processor. The system has 2 modes: Training and Recognition, based on the vector quantization approach. Index Terms speech recognition, fast fourier transform, Mel-Frequency filter, audio cepstrum, vector quantization, DE2, nios, verilog, robot control, vietnamese. I. INTRODUCTION Speech recognition in human robot interaction have been investigating and developing by many organizations in the world. Some noticeable achievements are Speech Recognition (Microsoft), HTK (Machine Intelligence Laboratory, Sphinx (CMU) Most of solutions listed above usually run on high speed computers with large resource requirement. They have not been capable of being integrated in particular purposes system that requires a tiny low power and low resource requirement such as control robots, machines or family devices [1][2]. There are some studies and experiments on speech recognition on FPGA, such as The Speech Recognition Chip Implementation on FPGA [3], An FPGA Implement of Speech Recognition with Weighted Finite State Transducers [4]. However, in general, those studies just concentrate on recognition but have not been applied to interacting with the robots, and absolutely, have not been designed to work with Vietnamese. Inspired with that idea, we decide to make a speech recognition system on FPGA for robot control, using the DE2 development kit [7] (available in Computer Engineer Department s Laboratory University of Information Technology) for studying/ researching purpose of the department. The system is based on the hardware design in Verilog HDL using Quartus design tool, and some programmed Manuscript received December 20, 2012; revised February 12, 2013. modules (of Terasic) such as FFT modules, SDRAM controller. II. SYSTEM SCHEMA This paper is proposing a system schema included with hardware and software. Hardware system includes modules as in Fig. 1: ADC, Memory Controller, FFT Controller and Nios II. Speech Output Signal Analog Signal Robot Command Figure 1. Training ADC NIOS II Digital Data Recognition Memory Controller Fourier Data DE2 Board Segment Data Hardware of the system FFT Controller Control Signal Dataflow: Speech input (analog signal) from microphone is got into the system after converted to digital signal through the ADC module. The digital signal (in time domain) is converted to frequency domain using FFT module. The digital signal (in frequency domain) is got into Nios II and to be processed here. The recognition process is carried out by a C program, based on vector quantization approach. Output is the command signal (to the robot control) corresponding to input speech. A C program is embedded in the Nios II processor. It handles the processing and carrying out training/ recognition process, displaying input/ output data. Speech data is converted from time domain to frequency domain here by FFT module, then it is extracted to get speech characteristic. In the training mode, speech characteristic data will be grouped using K-means clustering, this form the codebook data pattern to be compared with the input speech. The recognition process is executed by comparing extracted speech characteristic vectors of input speech with the vectors in codebook (trained before by training mode, through the same data flow). The result is displayed on LCD and indicating LED; the output control signal is transmitted to GPIO pins of 2013 Engineering and Technology Publishing doi: 10.12720/joace.1.3.274-278 274

the DE2 board. Robot receives control signal from here and act as corresponding action (programmed before). III. SYSTEM DESIGN A. Hardware Hardware in DE2 board handles the speech receiving, storing it into memory, segmenting, and Fast Fourier transformation process. ADC : This is start point of the system. This module receives signal from microphone (analog), calculates, samples and returns digital signal with configurable sample rate and resolution. The digital signal data is stored into RAM, waiting for being processed. Memory Controller : This module carries out data storing function (using RAM-onchip memory) after speech signal is converted to digital. The data storage capacity can be changed (depend on control signal). FFT Controller : This gets data from memory, segments the data (with 1/3 overlap ratio) and controls the converting process from time domain to frequency of the FFT module (from Megacore Function Library). * Data segmenting Fig. 2: Data from memory is segmented into overlapping frames. There are N samples in a frame. The distance between two frames is M samples. M = (1 / 3) N Speech data will be segmented into overlapping frames. Figure 3. FFT Controller States NIOS II : The processor handles the analyzing process, included training and recognition mode, after data is transformed by FFT. Nios II components are described in Fig. 4. Figure 2. Data segmenting Example: If each frame has 300 sample, the second frame will begin from sample number 100 (M=100). * FFT controlling diagram Fig. 3 RESET: default state (initialization, reset). Values are initiated in this state. INWAIT: wait for the FFT istart start converting signal (from software). INSOP (Start of Packet): actives control signals, samples the fisrt data block received from memory. INMID: samples all of the remaining data received from memory, until the number of sampled sample (incount) is less than required sample number (LEN-2, with LEN is the FFT size which is 256 in this system). INEOP (End of Packet): Get the last sample, and active the stop-receiving-data signal, begin the FFT process. Figure 4. NIOS II components Nios II Processor: The type is Nios II/f, 4KB data cache, level 2 and integrated floating point. SDRAM Controller: communicate with the 8MB SDRAM - main memory of the system Peripheral buses: ikeys, iswitchs, oledr, oledg, LCD, SEG7 ifftcoeff, FFT_exponent: FFT module data bus. System control signal: ostart, iffcomplete The Nios II processor operates at the rate of 100MHz, SDRAM at 100MHz -3ns to get the system to be synchronized. The training and recognition process is described in the next section. B. Software We use a Nios II Application because of its ease of use. Analysis process is programmed in C, and is embedded in Nios II to control the system operation. 275

The Vector Quantization approach is used in both training and recognition mode. Training mode: Each word is spoken several times, the system will analyze, collect and classify data into a codebook (a vector collection, smaller than initial collection), particularly. The process is described in Fig. 5. f c (m): center frequency of the m th filter Fs: speech input sample rate F(k) = kfs/n: frequency of the k th sample of n samples Center frequency f c (m) is a linear frequency in Mel field, or the logarithm in normal frequency scale. The conversion expression to Mel scale: f mel = 2595 log10 1 + f Computing output energy at each filter: This energy is calculated by log of sum of the products of signal s frequency amplitude and corresponding weight in the filter. e m = log 700 (2) N j =1 m j X j (3) h m (j): signal s frequency amplitude X(j): corresponding weight Discrete Cosine Transform DCT: Use this to get Mel-frequency cepstral coefficients M 1 m =0 0 n M (4) M c n = dct S m = S m cos πn m+1 2 Figure 5. The training process Speech detection: At first, the system continuously checks if the input signal is a speech or not by comparing the output exponent from the FFT Controller module (stored in FFT_exponent) with a threshold parameter. The higher frequency of the speech makes the higher exponent (because of shifting process for ensuring the bus data width). After some experiments, we have got that if the FFT_exponent is less than 61, there is a human voice. MFCC filter [5][6]: This is for extracting speech characteristic data according to listen capability of human ear (Mel Frequency). We designed Mel filter banks as Fig. 6. Filter expression: H m k = 0 for f k < f c (m 1) f k f c (m 1) f c m f c (m 1) c m 1 f k < f c (m) f k f c (m+1) f c m f c (m+1) for f c m f k < f c (m + 1) 0 for f k f c m + 1 Figure 6. Mel filter banks in frequency domain (1) c[n]: MFCC characteristic vector S[m]: output energy at m th filter M: number of filters N: number of characteristics to be extracted K-means clustering: To reduce the training vector collection which make the training codebook. This is based on Euclidean distance formula. Expression for calculating the distance (or space) of two vector: d x, y = P k=1 x k y 2 k (5) x, y: vectors to be compared P: vector size (P = 12 in this system) Operation steps: Initialization: Identify codebook size by randomly choosing N vector for codebook collection (each codeword is a vector). Corresponding to each codework is the center of the vector cluster. Find nearest neighbor vector: for each vector from training collection, calculate the Euclidean distance to get the nearest codeword with the vector, label it (to know that it belongs to that codeword cell). Update the center: Corresponding to each cell, update the codeword which is the center of all the vectors belonging to that cell. Repeat 2-3, until there is not any vector changes the cells. Recognition mode: Initial steps are similar with the training mode. After detecting the speech and extracting MFCC characteristics for the received speech, compare the MFCC characteristics vector with each codebook in 276

the training collection to get out the needed speech. The recognition process is described in Fig. 7: The accurate threshold: this is based on the minimum distance D i in the recognition mode. According to experiments, 2.2 < D i < 3.6 make the best recognition result. If D i is out of this range, the input speech is not on the vocabulary set. In Fig. 9 and Fig. 10, the system is command the robot to do the Trái and Nhanh command, led LEDR [5] and led LEDR [7] is on corresponding to received command. The 7-segment leds shows some command parameters. Recognized command is display on the LCD. Some experiment results are showing in the Fig. 8, Fig. 9, Fig. 10. Figure 7. Recognition process How to get the appropriate codebook for input speech? Assume that after the speech characteristics extraction, we have the vector collection (T(x 1, x 2, x 3 x T ). There is V codebook for vocabulary set and the codebook of i th word is {y 1i, y 2i, y 3i y mi } with m is the codebook size. To get the appropriate codebook for the input speech, we calculate the distance of the characteristics vector with each of codebook in the training collection. The distance calculating expression with i th codebook: Figure 8. Waiting state Figure 9. Recognized "Trái" (left) command D i = 1 T T t=1 min 1 m M d x t, y mi (6) From this, we can identify which codebook has the minimum distance D i which is greater than recognition threshold with characteristics vector input T, as the following formula: j = argmin 1 i v D j (7) IV. RESULT The system has been finished included with a hardware design and the associate software (Nios II Application) to control the operation of the hardware, carry out the training and recognition process. Fixed prepared vocabulary set includes: robot control command (in Vietnamese) such as Tới (forward), Lùi (back), Trái (left), Phải (right), Nhanh (fast), Chậm (slow), Vừa (medium) Dừng (stop), Xoay (turn) Control signal: Switch SW [0]: chose the recognition mode. Switch SW [1]: chose the training mode. Switch SW [2]: chose the fixed training data from a text file or the new training data on-the-go. System states and indicators: Waiting state: when there is no operation, LCD shows Waiting 7-segment leds display all zero Fig. 8. Recognizing/ operating state: leds LEDR [3]-11] indicate the recognized command to control the robot, the LCD displays Recognized: <command>. Figure 10. Recognized "Nhanh" (faster) command Because of the accent differences, the system operates more accurately with standard and common voice. Each command in vocabulary set is spoken 10 times in 4 various voices which is not so different from the trained voice, the result accuracy is averagely greater than 80%. Tới (go) Lùi (back) TABLE I. EXPERIMENT RESULT Trái (left) Phải (right) Nhanh (fast) Chậm (slow) Vừa (middle speed) Dừng (stop) 1 T T T T T T T T T 2 T F T T T T T T T 3 T T T F T F T T T 4 T T T T T T F T T 5 F T T T T T T T T 6 T T T T T T F T T 7 T T F T T F T T T 8 T T T T T T T T T 9 T F T F T T T T T 10 T T T T T T T F T % 90 80 90 80 100 80 80 90 100 Xoay (rotate) 277

V. SUMMARY The paper has presented a speech recognition system on FPGA, for a particular purpose robot control which can work with Vietnamese input speech. We have designed the hardware based on DE2 components and programmed a software and embedded in the Nios II to control the operation of the system. The recognition method and algorithm is simple, but experimental results have shown that it is completely capable of implicating our system to control other device (here is robot). This lead to the possibility of making integrated chips in small control system, such as controlling the robots, family devices or cars by speech command. [4] J. Choi, K. You, and W. Sung, An FPGA implementation of speech recognition with weighted finite state transducers, in Proc IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 1602-1605. [5] V. Lalitha and P. Prema, A Mel- Filter and Cepstrum based algorithm for noise suppression in cochlear implants [6] B. A. Shenoi and J. Wiley & sons, Introduction to digital signal processing and filter design, John wiley & sons, inc., 2006; PP: 154-161. Phan Dinh Duy was born on October 26, 1988 in Binh Dinh province, Vietnam. He obtained his B.S. degree in Computer Engineering from the University of Information Technology where he is working on Circuit Design and machine learning. REFERENCES [1] Y. Choi, K. You, J. Choi, and W. Sung, A real-time FPGA-based 20.000-word speech recognizer with optimized DRAM access, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 57, no. 8, pp. 2119-2131, Aug 2010. [2] S. J. Melnikoff, S. F. Quigley, and M. J. Russell, Speech recognition on an fpga using discrete and continous HMM, in Proc. 12 th International Conference on Field Programmable Logic Applications, FPL2002. [3] C. Y. Chang, C. F. Chen, S. T. Pan, and X. Y. Li, The speech recognition chip implementation on FPGA, 2010 2nd International Conference on Mechanical and Electronics Engineering (ICMEE 2010), Kyoto, Japan, vol. 2, pp. 6-10, Aug. 2010. Vu Duc Lung received the B.S. and M.S. degree in computer engineering from Saint Petersburg State Polytechnical University in 1998 and 2000, respectively. He got the Ph.D. degree in computer science from Saint Petersburg Electrotechnical University in 2006. From 2006 until now, he works at the University of Information Technology, VNU HCMC as a lecturer. His research interests include machine learning, human-computer interaction and FPGA technology. He is a member of IEEE, ACOMP 2011 and Publication chair of ICCAIS 2012. 278