Vocal Command Recognition Using Parallel Processing of Multiple Confidence-Weighted Algorithms in an FPGA

Vocal Command Recognition Using Parallel Processing of Multiple Confidence-Weighted Algorithms in an FPGA ECE-492/3 Senior Design Project Spring 2015 Electrical and Computer Engineering Department Volgenau School of Engineering George Mason University Fairfax, VA Team members: Faculty Supervisor: Kevin Briggs, Scott Carlson, Christian Gibbons, Jason Page, Antonia Paris, and David Wernli Prof. Jens Peter Kaps Abstract A novel system for vocal command recognition utilizing a field-programmable gate array (FPGA) chip was developed. An analog audio signal is processed and run through three speech recognition algorithms to determine the spoken word. Each algorithm is processed in parallel, creating greater levels of accuracy with little loss in speed. FPGA technology is utilized since it is well suited to parallel processing with low latency and near real-time performance. When compared to software based voice recognition methods, this system offers a reduction in both overhead and latency while improving response times. FPGAs are also comparatively less expensive than general purpose processors and their requisite hardware. The small footprint and low cost of this vocal command recognition system makes it well suited for inexpensive applications with a limited, fixed vocabulary in a number of varying environments where system overhead or connectivity are of concern. 1. Introduction Current implementations of vocal command interfaces suffer from a number of shortcomings. For example, connected systems delegate processing of speech signals to remote locations. This dependency on an external connection leaves the system vulnerable to connectivity outages and security breaches. It also introduces additional latency to the system, rendering real time speech recognition a challenge. The majority of disconnected systems are speaker dependent and therefore unable to interpret different speech patterns. Many disconnected vocal command interfaces are also quite expensive, making them unfeasible for cost sensitive applications. The vocal command interface implemented in this project is designed to be disconnected, speaker independent, contain an extensible vocabulary, and be relatively low in cost.

2. Requirements specification The following is a list of requirements assembled through interviews with our potential users: 1. INTERFACE 1.A The system shall take vocal commands from the end user to affect an output signal. 2. INPUT 2.A The system shall compare vocal command against a list of predefined command words. 2.B The system shall have an extensible vocabulary of no less than 10 commands. 2.C The system shall be capable of handling a command of up to 1 second in length. 2.D The system will be capable of distinguishing between low-level noise and spoken inputs. 2.E The system shall be speaker independent. 3. OUTPUT 3.A Upon positive match, the system shall output the appropriate control signal to an external discrete device or system. 3.B Upon no match, the system shall output a control signal indicating that status. 3.C The system should respond within 100ms from spoken command completion. 3.D The output shall follow a rigid, extensible structure for simple integration into existing control devices. 4. TECHNOLOGY 4.A The system must utilize FPGA technology in speech recognition. 4.B The system shall operate in a state of continuous monitoring of any input signals without requiring any extraneous physical prompt from the user. 4.C The system shall be capable of processing the spoken command through at least two recognition algorithms simultaneously 4.D The system shall be modular allowing easy interface with existing hardware. 4.E The system shall be capable of operating in an environment with a signal to noise ratio greater than - 10dB. 5. WISHLIST 5.A The system should be capable of allowing the end user to extend the known command list without software or hardware modification. 5.B The system should be capable of allowing the end user to improve the response to a specific spoken command without modification of hardware or software. 5.C The system should be given a pre-processed signal to adjust for noisy environments (or process within the system). 5.D The system should be capable of swapping the recognition module through software updates. 5.E The system should have the ability to distinguish voice despite a wider variety of environmental parameters, such as background noise level, speaker accent, and speaker tone. 3. System development In the first half of this projectthe team developed a working prototype of a vocal command recognition system in MATLAB. The goal behind developing the prototype in MATLAB was to be able to understand the workings of each individual module in the system, and how they will ultimately tie together in VHDL. In the second half, the project moved from the prototyping phase into the actual implementation phase. The challenge here was to decompose our MATLAB model into individual modules in VHDL. To translate our MATLAB prototype into VHDL for implementation on an FPGA, we first determined all the necessary blocks to execute each function of our MATLAB prototype. After decomposing our MATLAB prototype into functional building blocks we then determined the necessary inputs and outputs for each building block. Treating them as black boxes we created VHDL modules corresponding to each building block. We then created the necessary signals and instantiations of each module within a top level architecture shown in Figure 1. The speech detector module is responsible for receiving data from the external ADC and determining if the current input audio signal contains enough energy to correspond to speech. The energy content of the speech signal is computed in real time and then passed through a running average filter. When the average energy 2

content rises above a predefined threshold, a flag is generated that enables writing of sample data to a register that stores incoming samples until the end of the word is detected by the energy content dropping back below the threshold. The analog front end and speech detector modules were implemented in VHDL and were tested with the PMODMIC (an add-on board provided by Digilent that is compatible with Digilent snexys 3 FPGA development board.) The signature extraction module is responsible for computing the unique features of interest that allow identification and discernment between different sets of audio signals. One method of extracting such information from a speech signal is known as LPC, or linear predictive coding. Figure 1: System architecture Linear predictive coding suggests that human speech begins as a series of glottal pulses or an impulse train that forces air up the vocal tract where the vocal tract acts as a filter on this signal. The goal of linear predictive coding is to determine a set of coefficients that can be used to model the behavior of the vocal tract during the utterance of a word. That set of coefficients can then be used to linearly predict the magnitude of the next speech sample such as in speech synthesis applications, and it can also be used for matching purposes as in the case of vocal command recognition. The general form of the LPC prediction equation is given below [1]. The signature extraction module generates these LPC coefficients for detected speech samples. It receives stored samples from the signature register and runs an algorithm to generate LPC coefficients. The output is then fed to the set of matching algorithms where it is compared against the LPC coefficients of stored command words. To calculate the LPC values, an autocorrelation method combined with matrix multiplication was used. The autocorrelation is responsible for generating the vector that is used to populate the matrix that will allow for the LPCs calculation. Basic autocorrelation across the time domain is used on each 8-bit input sample to iteratively create the autocorrelation values. These values will in turn create a vector of our desired length of 21. These would then be used to populate a matrix that allow for the calculation of the coefficients. Solving for the LPC coefficients from this matrix directly brings a large footprint if conducted in parallel, and many operations if conducted sequentially. However, as the resulting matrix is a Toeplitz Matrix, it can be solved using the Levinson- Durbin Recursion algorithm which can have similar resource utilization of a sequential solution while cutting down on the number of operations. The matching algorithms are responsible for matching incoming feature vectors generated by the signature extraction module against command word features stored in the command register. Three matching algorithms are utilized in parallel, each performing a different algorithm to determine the similarity between incoming speech and stored command words. The outputs of each match algorithm are utilized in the sorting/weight/rank 3

module to determine which, if any, command word was spoken and generate the necessary status and control output. The first matching algorithm (Figure 2) computes the variance in LPC coefficients of incoming word utterances, variance of stored command words, and covariance of incoming word utterances and stored command words. Figure 2: Matching algorithm #1 - Equations and data path The second matching algorithm is design to compute the Euclidean distance between LPC coefficients of incoming word utterances and stored command words. The final matching algorithm computes the difference in rate of change between LPC coefficients of the incoming word utterance and stored command words. The sorting algorithm was designed to take a minimum number of clock cycles while allowing incoming data to be placed in the sorted list without causing any delay. This method utilizes a custom designed register structure along with a sorting algorithm loosely based on the Radix sort algorithm.the full execution of this insertion then 4

takes only a single clock cycle. The total time for finding the proper insertion location could be reduced by adding additional read logic to allow reading from multiple locations at a single time. Weighting module takes the ranked scores from each of the match scoring modules and creates the master list of the top 24 command matches from each. Each scoring module s best matches are given a weight equal to the square root of the rank, times 10,000. The actual weights are provided via lookup table.the score data is read in from the sorted score register. This data is addressed by rank, and contains the command number (corresponding to the master list of signature data) as well as the actual score. The score data is discarded. The command number is then utilized as the address for the write portion of the module s actions and the data into the subsequent register is provided as the sum of the current weight and the loaded weight from that register.once through all of the top ranked command list matches, the weighted words will be in the next register. The final ranking is done through a re-use of the sort method and register description from the score sorting module with the addition of simple control logic to manage the generation of input and start signals. 4. Experimentation plan For the final design of our system, a testing plan has been devised that was broken down into three stages. Each stage of the testing plan aimed to address different requirements set for the system during the design of this vocal command recognition system. Stage one of the experimentation plan tested the system utilizing prerecorded utterances. The specifications for this stage of the experimentation plan were as follows: A minimum of 30 different words with minimum of 10 utterances per word. A minimum of 10 different speakers with a variation in accents and gender. The test is deemed a success if the recognition accuracy is at least 80% and the system output takes 0.5 seconds or less. Stage two of the experimentation plan was designed to test the systems noise tolerance utilizing prerecorded environmental sounds. It was designed to reflect real world operation of the system, where the system was subjected to sounds that are not speech. The requirements of this stage were that the system was subjected to a minimum of ten (10) sounds that were not human speech. The experiment in this case was deemed a success, if the system could recognize each sound as a non-command match. Stage three of the experimentation plan tested the systems functionality and usability utilizing live inputs. The requirements for this experimentations stage were as follows: All team members and 20 non-team members shall provide live inputs. Include non-team members to include variation in gender and accent, and at least two users with speech impediments Each user shall enter a minimum of 10 commands with at least 30 utterances of each. Define success, if command words are properly stored, recognized with at least 80% accuracy and correct command is executed within 0.5 seconds from the end of the spoken utterance. Receive feedback from non-team members to determine ease of interacting with the system. 5. Experimental validation Each module has been tested individually and functions in a manner producing output as expected based on the input provided. The outputs of the individual modules are showing results that reflect those outputted by the MATLAB prototype. Below are simulation waveforms of selected three modules. 5

Figure 3: LPC module simulation and device utilization table Figure 4: Match algorithm 1 simulation Figure 5: Sorting algorithm simulation 6

System recognition results are shown in Figure 6. Over 2500 recorded test utterances were run through 5 million individual qualitative and quantitative tests. Optimal extraction algorithm selection, match algorithms and parameters, as well as final weighting formulae were determined based on these results. We noticed that: 1. Weighting drastically improves averaged performance, from 53% to 77% total accuracy and enhances robustness to noise. 2. Inclusion of the less accurate methods still increases overall system accuracy. The system was also tested using the video game Frogger as the external device. Five command words were used, with overall (speaker dependent) accuracy above 91%. Figure 6: Test recognition results Figure 7: Software vs. hardware speed Acomparison of gain in speed between pure software implementation and the hardware implementation using FPGEs is shown in Figure 7. As expected the gain is very significant and demonstrates a real-time performance. 6. Conclusions Vocal Command Recognition by itself is an advanced task. Combining advanced signal processing with advanced circuitry to make a device that can turn sound waves into a physical stimulus makes this task exponentially more difficult. Because of this there is a large room for error, with several engineers working on different parts, synchronous designing is essential. After dissecting the individual modules it was often found that the culprit is either the device resource utilization is over 100% or the maximum frequency is too low. As in most cases with FPGAs this is an issue of area vs. speed on the device. In many cases, a simple conversion from concurrent to sequential was required, or from sequential to concurrent. It was also helpful to learn about the implantation of BRAM, allowing us to significantly reduce several modules with over 10,000% resource utilization to less than 50% and in most cases less than 10%. In a specific case in the LPC module, where the device resource utilization was over 10,000% a new approach to the module was required because there was no efficient way to infer the algorithm that was being used. The project as a whole had many more hurdles than originally anticipated, especially after realizing the complexity of the project at hand. Pretesting was a key part of successfully completing this type of project. In this specific project s case the transformation from MATLAB simulation to FPGA simulation was more intensive than originally anticipated. References [1]http://www.ece.ucsb.edu/Faculty/Rabiner/ece259/digital%20speech%20processing%20course/lectures_new/Lecture%20 13_winter_2012_6tp.pdf [2] http://www.ece.iit.edu/~pfelber/speechrecognition/report.pdf 7