Vocal Command Recognition Using Parallel Processing of Multiple Confidence-Weighted Algorithms in an FPGA

Similar documents
Abstract of PhD Thesis

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Adaptive Filters Application of Linear Prediction

Available online at ScienceDirect. Anugerah Firdauzi*, Kiki Wirianto, Muhammad Arijal, Trio Adiono

Adaptive Filters Linear Prediction

Synthesis of speech with a DSP

Communications Theory and Engineering

DIGITAL FILTERING OF MULTIPLE ANALOG CHANNELS

Speech Enhancement using Wiener filtering

Chapter 4 SPEECH ENHANCEMENT

EE482: Digital Signal Processing Applications

Speech Synthesis using Mel-Cepstral Coefficient Feature

SIMULATION VOICE RECOGNITION SYSTEM FOR CONTROLING ROBOTIC APPLICATIONS

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

Autonomous Vehicle Speaker Verification System

Video Enhancement Algorithms on System on Chip

BPSK_DEMOD. Binary-PSK Demodulator Rev Key Design Features. Block Diagram. Applications. General Description. Generic Parameters

Design of a High Speed FIR Filter on FPGA by Using DA-OBC Algorithm

Implementation of FPGA based Design for Digital Signal Processing

Digital Signal Processing. VO Embedded Systems Engineering Armin Wasicek WS 2009/10

IMPLEMENTATION OF G.726 ITU-T VOCODER ON A SINGLE CHIP USING VHDL

A LPC-PEV Based VAD for Word Boundary Detection

The Comparative Study of FPGA based FIR Filter Design Using Optimized Convolution Method and Overlap Save Method

Finite Word Length Effects on Two Integer Discrete Wavelet Transform Algorithms. Armein Z. R. Langi

Different Approaches of Spectral Subtraction Method for Speech Enhancement

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Design and Implementation of Compressive Sensing on Pulsed Radar

Overview of Code Excited Linear Predictive Coder

Hardware Implementation of Proposed CAMP algorithm for Pulsed Radar

DSP VLSI Design. DSP Systems. Byungin Moon. Yonsei University

Performance analysis of voice activity detection algorithm for robust speech recognition system under different noisy environment

Simulating and Testing of Signal Processing Methods for Frequency Stepped Chirp Radar

Design of Multiplier Less 32 Tap FIR Filter using VHDL

NCCF ACF. cepstrum coef. error signal > samples

Digital Systems Design

Chapter IV THEORY OF CELP CODING

DIGITAL SIGNAL PROCESSING WITH VHDL

A Digital Signal Processor for Musicians and Audiophiles Published on Monday, 09 February :54

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

UNIT-II LOW POWER VLSI DESIGN APPROACHES

Digital Logic, Algorithms, and Functions for the CEBAF Upgrade LLRF System Hai Dong, Curt Hovater, John Musson, and Tomasz Plawski

INTRODUCTION. In the industrial applications, many three-phase loads require a. supply of Variable Voltage Variable Frequency (VVVF) using fast and

Time Matters How Power Meters Measure Fast Signals

Simulation of Algorithms for Pulse Timing in FPGAs

Using an FPGA based system for IEEE 1641 waveform generation

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

PE713 FPGA Based System Design

AVAL: Audio-Visual Active Locator ECE-492/3 Senior Design Project Spring 2014

Speech and Audio Processing Recognition and Audio Effects Part 3: Beamforming

SNGH s Not Guitar Hero

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

An FPGA Based Architecture for Moving Target Indication (MTI) Processing Using IIR Filters

Speech Enhancement Based On Noise Reduction

Multi-core Platforms for

Spatial Audio Transmission Technology for Multi-point Mobile Voice Chat

Decision Based Median Filter Algorithm Using Resource Optimized FPGA to Extract Impulse Noise

Linear Predictive Coding *

Applications of Music Processing

Encoding a Hidden Digital Signature onto an Audio Signal Using Psychoacoustic Masking

ARM BASED WAVELET TRANSFORM IMPLEMENTATION FOR EMBEDDED SYSTEM APPLİCATİONS

Index Terms. Adaptive filters, Reconfigurable filter, circuit optimization, fixed-point arithmetic, least mean square (LMS) algorithms. 1.

AN AUTOREGRESSIVE BASED LFM REVERBERATION SUPPRESSION FOR RADAR AND SONAR APPLICATIONS

AUTOMATIC SPEECH RECOGNITION FOR NUMERIC DIGITS USING TIME NORMALIZATION AND ENERGY ENVELOPES

Iris Recognition-based Security System with Canny Filter

CHAPTER 5 IMPLEMENTATION OF MULTIPLIERS USING VEDIC MATHEMATICS

A Built-In Self-Test Approach for Analog Circuits in Mixed-Signal Systems. Chuck Stroud Dept. of Electrical & Computer Engineering Auburn University

Isolated Digit Recognition Using MFCC AND DTW

Years 9 and 10 standard elaborations Australian Curriculum: Digital Technologies

Separation and Recognition of multiple sound source using Pulsed Neuron Model

Area Efficient and Low Power Reconfiurable Fir Filter

High Speed Binary Counters Based on Wallace Tree Multiplier in VHDL

A New High Speed Low Power Performance of 8- Bit Parallel Multiplier-Accumulator Using Modified Radix-2 Booth Encoded Algorithm

Synthesis Algorithms and Validation

Measuring the Power Efficiency Of Subthreshold FPGAs For Implementing Portable Biomedical Applications

Design of FIR Filter on FPGAs using IP cores

VLSI Implementation of Image Processing Algorithms on FPGA

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Real Time Pulse Pile-up Recovery in a High Throughput Digital Pulse Processor

DESIGN AND IMPLEMENTATION OF AN ALGORITHM FOR MODULATION IDENTIFICATION OF ANALOG AND DIGITAL SIGNALS

Voice Excited Lpc for Speech Compression by V/Uv Classification

In this lecture, we will look at how different electronic modules communicate with each other. We will consider the following topics:

Oscilloscope Measurement Fundamentals: Vertical-Axis Measurements (Part 1 of 3)

Leverage always-on voice trigger IP to reach ultra-low power consumption in voicecontrolled

Using the VM1010 Wake-on-Sound Microphone and ZeroPower Listening TM Technology

Speech Compression Using Voice Excited Linear Predictive Coding

THE BENEFITS OF DSP LOCK-IN AMPLIFIERS

Implementing Logic with the Embedded Array

Serial and Parallel Processing Architecture for Signal Synchronization

A Comparative Study on Direct form -1, Broadcast and Fine grain structure of FIR digital filter

Winner-Take-All Networks with Lateral Excitation

FIR Filter Design on Chip Using VHDL

A DEVICE FOR AUTOMATIC SPEECH RECOGNITION*

ECEn 487 Digital Signal Processing Laboratory. Lab 3 FFT-based Spectrum Analyzer

Telecommunication Electronics

RESIDUE AMPLIFIER PIPELINE ADC

Calibration of Microphone Arrays for Improved Speech Recognition

Keysight Technologies Pulsed Antenna Measurements Using PNA Network Analyzers

CHAPTER 5 NOVEL CARRIER FUNCTION FOR FUNDAMENTAL FORTIFICATION IN VSI

Low-Power Communications and Neural Spike Sorting

Advances in Military Technology Vol. 5, No. 2, December Selection of Mode S Messages Using FPGA. P. Grecman * and M. Andrle

Transcription:

Vocal Command Recognition Using Parallel Processing of Multiple Confidence-Weighted Algorithms in an FPGA ECE-492/3 Senior Design Project Spring 2015 Electrical and Computer Engineering Department Volgenau School of Engineering George Mason University Fairfax, VA Team members: Faculty Supervisor: Kevin Briggs, Scott Carlson, Christian Gibbons, Jason Page, Antonia Paris, and David Wernli Prof. Jens Peter Kaps Abstract A novel system for vocal command recognition utilizing a field-programmable gate array (FPGA) chip was developed. An analog audio signal is processed and run through three speech recognition algorithms to determine the spoken word. Each algorithm is processed in parallel, creating greater levels of accuracy with little loss in speed. FPGA technology is utilized since it is well suited to parallel processing with low latency and near real-time performance. When compared to software based voice recognition methods, this system offers a reduction in both overhead and latency while improving response times. FPGAs are also comparatively less expensive than general purpose processors and their requisite hardware. The small footprint and low cost of this vocal command recognition system makes it well suited for inexpensive applications with a limited, fixed vocabulary in a number of varying environments where system overhead or connectivity are of concern. 1. Introduction Current implementations of vocal command interfaces suffer from a number of shortcomings. For example, connected systems delegate processing of speech signals to remote locations. This dependency on an external connection leaves the system vulnerable to connectivity outages and security breaches. It also introduces additional latency to the system, rendering real time speech recognition a challenge. The majority of disconnected systems are speaker dependent and therefore unable to interpret different speech patterns. Many disconnected vocal command interfaces are also quite expensive, making them unfeasible for cost sensitive applications. The vocal command interface implemented in this project is designed to be disconnected, speaker independent, contain an extensible vocabulary, and be relatively low in cost.

2. Requirements specification The following is a list of requirements assembled through interviews with our potential users: 1. INTERFACE 1.A The system shall take vocal commands from the end user to affect an output signal. 2. INPUT 2.A The system shall compare vocal command against a list of predefined command words. 2.B The system shall have an extensible vocabulary of no less than 10 commands. 2.C The system shall be capable of handling a command of up to 1 second in length. 2.D The system will be capable of distinguishing between low-level noise and spoken inputs. 2.E The system shall be speaker independent. 3. OUTPUT 3.A Upon positive match, the system shall output the appropriate control signal to an external discrete device or system. 3.B Upon no match, the system shall output a control signal indicating that status. 3.C The system should respond within 100ms from spoken command completion. 3.D The output shall follow a rigid, extensible structure for simple integration into existing control devices. 4. TECHNOLOGY 4.A The system must utilize FPGA technology in speech recognition. 4.B The system shall operate in a state of continuous monitoring of any input signals without requiring any extraneous physical prompt from the user. 4.C The system shall be capable of processing the spoken command through at least two recognition algorithms simultaneously 4.D The system shall be modular allowing easy interface with existing hardware. 4.E The system shall be capable of operating in an environment with a signal to noise ratio greater than - 10dB. 5. WISHLIST 5.A The system should be capable of allowing the end user to extend the known command list without software or hardware modification. 5.B The system should be capable of allowing the end user to improve the response to a specific spoken command without modification of hardware or software. 5.C The system should be given a pre-processed signal to adjust for noisy environments (or process within the system). 5.D The system should be capable of swapping the recognition module through software updates. 5.E The system should have the ability to distinguish voice despite a wider variety of environmental parameters, such as background noise level, speaker accent, and speaker tone. 3. System development In the first half of this projectthe team developed a working prototype of a vocal command recognition system in MATLAB. The goal behind developing the prototype in MATLAB was to be able to understand the workings of each individual module in the system, and how they will ultimately tie together in VHDL. In the second half, the project moved from the prototyping phase into the actual implementation phase. The challenge here was to decompose our MATLAB model into individual modules in VHDL. To translate our MATLAB prototype into VHDL for implementation on an FPGA, we first determined all the necessary blocks to execute each function of our MATLAB prototype. After decomposing our MATLAB prototype into functional building blocks we then determined the necessary inputs and outputs for each building block. Treating them as black boxes we created VHDL modules corresponding to each building block. We then created the necessary signals and instantiations of each module within a top level architecture shown in Figure 1. The speech detector module is responsible for receiving data from the external ADC and determining if the current input audio signal contains enough energy to correspond to speech. The energy content of the speech signal is computed in real time and then passed through a running average filter. When the average energy 2

content rises above a predefined threshold, a flag is generated that enables writing of sample data to a register that stores incoming samples until the end of the word is detected by the energy content dropping back below the threshold. The analog front end and speech detector modules were implemented in VHDL and were tested with the PMODMIC (an add-on board provided by Digilent that is compatible with Digilent snexys 3 FPGA development board.) The signature extraction module is responsible for computing the unique features of interest that allow identification and discernment between different sets of audio signals. One method of extracting such information from a speech signal is known as LPC, or linear predictive coding. Figure 1: System architecture Linear predictive coding suggests that human speech begins as a series of glottal pulses or an impulse train that forces air up the vocal tract where the vocal tract acts as a filter on this signal. The goal of linear predictive coding is to determine a set of coefficients that can be used to model the behavior of the vocal tract during the utterance of a word. That set of coefficients can then be used to linearly predict the magnitude of the next speech sample such as in speech synthesis applications, and it can also be used for matching purposes as in the case of vocal command recognition. The general form of the LPC prediction equation is given below [1]. The signature extraction module generates these LPC coefficients for detected speech samples. It receives stored samples from the signature register and runs an algorithm to generate LPC coefficients. The output is then fed to the set of matching algorithms where it is compared against the LPC coefficients of stored command words. To calculate the LPC values, an autocorrelation method combined with matrix multiplication was used. The autocorrelation is responsible for generating the vector that is used to populate the matrix that will allow for the LPCs calculation. Basic autocorrelation across the time domain is used on each 8-bit input sample to iteratively create the autocorrelation values. These values will in turn create a vector of our desired length of 21. These would then be used to populate a matrix that allow for the calculation of the coefficients. Solving for the LPC coefficients from this matrix directly brings a large footprint if conducted in parallel, and many operations if conducted sequentially. However, as the resulting matrix is a Toeplitz Matrix, it can be solved using the Levinson- Durbin Recursion algorithm which can have similar resource utilization of a sequential solution while cutting down on the number of operations. The matching algorithms are responsible for matching incoming feature vectors generated by the signature extraction module against command word features stored in the command register. Three matching algorithms are utilized in parallel, each performing a different algorithm to determine the similarity between incoming speech and stored command words. The outputs of each match algorithm are utilized in the sorting/weight/rank 3

module to determine which, if any, command word was spoken and generate the necessary status and control output. The first matching algorithm (Figure 2) computes the variance in LPC coefficients of incoming word utterances, variance of stored command words, and covariance of incoming word utterances and stored command words. Figure 2: Matching algorithm #1 - Equations and data path The second matching algorithm is design to compute the Euclidean distance between LPC coefficients of incoming word utterances and stored command words. The final matching algorithm computes the difference in rate of change between LPC coefficients of the incoming word utterance and stored command words. The sorting algorithm was designed to take a minimum number of clock cycles while allowing incoming data to be placed in the sorted list without causing any delay. This method utilizes a custom designed register structure along with a sorting algorithm loosely based on the Radix sort algorithm.the full execution of this insertion then 4

takes only a single clock cycle. The total time for finding the proper insertion location could be reduced by adding additional read logic to allow reading from multiple locations at a single time. Weighting module takes the ranked scores from each of the match scoring modules and creates the master list of the top 24 command matches from each. Each scoring module s best matches are given a weight equal to the square root of the rank, times 10,000. The actual weights are provided via lookup table.the score data is read in from the sorted score register. This data is addressed by rank, and contains the command number (corresponding to the master list of signature data) as well as the actual score. The score data is discarded. The command number is then utilized as the address for the write portion of the module s actions and the data into the subsequent register is provided as the sum of the current weight and the loaded weight from that register.once through all of the top ranked command list matches, the weighted words will be in the next register. The final ranking is done through a re-use of the sort method and register description from the score sorting module with the addition of simple control logic to manage the generation of input and start signals. 4. Experimentation plan For the final design of our system, a testing plan has been devised that was broken down into three stages. Each stage of the testing plan aimed to address different requirements set for the system during the design of this vocal command recognition system. Stage one of the experimentation plan tested the system utilizing prerecorded utterances. The specifications for this stage of the experimentation plan were as follows: A minimum of 30 different words with minimum of 10 utterances per word. A minimum of 10 different speakers with a variation in accents and gender. The test is deemed a success if the recognition accuracy is at least 80% and the system output takes 0.5 seconds or less. Stage two of the experimentation plan was designed to test the systems noise tolerance utilizing prerecorded environmental sounds. It was designed to reflect real world operation of the system, where the system was subjected to sounds that are not speech. The requirements of this stage were that the system was subjected to a minimum of ten (10) sounds that were not human speech. The experiment in this case was deemed a success, if the system could recognize each sound as a non-command match. Stage three of the experimentation plan tested the systems functionality and usability utilizing live inputs. The requirements for this experimentations stage were as follows: All team members and 20 non-team members shall provide live inputs. Include non-team members to include variation in gender and accent, and at least two users with speech impediments Each user shall enter a minimum of 10 commands with at least 30 utterances of each. Define success, if command words are properly stored, recognized with at least 80% accuracy and correct command is executed within 0.5 seconds from the end of the spoken utterance. Receive feedback from non-team members to determine ease of interacting with the system. 5. Experimental validation Each module has been tested individually and functions in a manner producing output as expected based on the input provided. The outputs of the individual modules are showing results that reflect those outputted by the MATLAB prototype. Below are simulation waveforms of selected three modules. 5

Figure 3: LPC module simulation and device utilization table Figure 4: Match algorithm 1 simulation Figure 5: Sorting algorithm simulation 6

System recognition results are shown in Figure 6. Over 2500 recorded test utterances were run through 5 million individual qualitative and quantitative tests. Optimal extraction algorithm selection, match algorithms and parameters, as well as final weighting formulae were determined based on these results. We noticed that: 1. Weighting drastically improves averaged performance, from 53% to 77% total accuracy and enhances robustness to noise. 2. Inclusion of the less accurate methods still increases overall system accuracy. The system was also tested using the video game Frogger as the external device. Five command words were used, with overall (speaker dependent) accuracy above 91%. Figure 6: Test recognition results Figure 7: Software vs. hardware speed Acomparison of gain in speed between pure software implementation and the hardware implementation using FPGEs is shown in Figure 7. As expected the gain is very significant and demonstrates a real-time performance. 6. Conclusions Vocal Command Recognition by itself is an advanced task. Combining advanced signal processing with advanced circuitry to make a device that can turn sound waves into a physical stimulus makes this task exponentially more difficult. Because of this there is a large room for error, with several engineers working on different parts, synchronous designing is essential. After dissecting the individual modules it was often found that the culprit is either the device resource utilization is over 100% or the maximum frequency is too low. As in most cases with FPGAs this is an issue of area vs. speed on the device. In many cases, a simple conversion from concurrent to sequential was required, or from sequential to concurrent. It was also helpful to learn about the implantation of BRAM, allowing us to significantly reduce several modules with over 10,000% resource utilization to less than 50% and in most cases less than 10%. In a specific case in the LPC module, where the device resource utilization was over 10,000% a new approach to the module was required because there was no efficient way to infer the algorithm that was being used. The project as a whole had many more hurdles than originally anticipated, especially after realizing the complexity of the project at hand. Pretesting was a key part of successfully completing this type of project. In this specific project s case the transformation from MATLAB simulation to FPGA simulation was more intensive than originally anticipated. References [1]http://www.ece.ucsb.edu/Faculty/Rabiner/ece259/digital%20speech%20processing%20course/lectures_new/Lecture%20 13_winter_2012_6tp.pdf [2] http://www.ece.iit.edu/~pfelber/speechrecognition/report.pdf 7