A DEVICE FOR AUTOMATIC SPEECH RECOGNITION*

Similar documents
Mel Spectrum Analysis of Speech Recognition using Single Microphone

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Isolated Digit Recognition Using MFCC AND DTW

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Introduction to cochlear implants Philipos C. Loizou Figure Captions

Speech Synthesis using Mel-Cepstral Coefficient Feature

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Speech and Music Discrimination based on Signal Modulation Spectrum.

X. SPEECH ANALYSIS. Prof. M. Halle G. W. Hughes H. J. Jacobsen A. I. Engel F. Poza A. VOWEL IDENTIFIER

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or

SIMULATION VOICE RECOGNITION SYSTEM FOR CONTROLING ROBOTIC APPLICATIONS

Speech Signal Analysis

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Performance Analysiss of Speech Enhancement Algorithm for Robust Speech Recognition System

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

SGN Audio and Speech Processing

CS 188: Artificial Intelligence Spring Speech in an Hour

Adaptive Filters Application of Linear Prediction

IMPLEMENTATION OF SPEECH RECOGNITION SYSTEM USING DSP PROCESSOR ADSP2181

High-speed Noise Cancellation with Microphone Array

Audio Similarity. Mark Zadel MUMT 611 March 8, Audio Similarity p.1/23

TRANSFORMS / WAVELETS

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

The ArtemiS multi-channel analysis software

Performance analysis of voice activity detection algorithm for robust speech recognition system under different noisy environment

Appendix B. Design Implementation Description For The Digital Frequency Demodulator

Auditory Based Feature Vectors for Speech Recognition Systems

Vocal Command Recognition Using Parallel Processing of Multiple Confidence-Weighted Algorithms in an FPGA

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Cepstrum alanysis of speech signals

Digitally controlled Active Noise Reduction with integrated Speech Communication

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

VOICE COMMAND RECOGNITION SYSTEM BASED ON MFCC AND DTW

SGN Audio and Speech Processing

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

Converting Speaking Voice into Singing Voice

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Recent Development of the HMM-based Singing Voice Synthesis System Sinsy

EE482: Digital Signal Processing Applications

MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM

Implementation of Text to Speech Conversion

Research Article DOA Estimation with Local-Peak-Weighted CSP

Figure 1: Block diagram of Digital signal processing

AUTOMATIC SPEECH RECOGNITION FOR NUMERIC DIGITS USING TIME NORMALIZATION AND ENERGY ENVELOPES

Speech Recognition on Robot Controller

Voice Activity Detection

Michael F. Toner, et. al.. "Distortion Measurement." Copyright 2000 CRC Press LLC. <

Different Approaches of Spectral Subtraction Method for Speech Enhancement

FFT 1 /n octave analysis wavelet

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

Signal Processing. Introduction

A102 Signals and Systems for Hearing and Speech: Final exam answers

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Statistical properties of urban noise results of a long term monitoring program

Chapter 1: Introduction to audio signal processing

Section 5.2 Graphs of the Sine and Cosine Functions

Quarterly Progress and Status Report. Synthesis of selected VCV-syllables in singing

VECTOR QUANTIZATION-BASED SPEECH RECOGNITION SYSTEM FOR HOME APPLIANCES

Audio Engineering Society Convention Paper Presented at the 110th Convention 2001 May Amsterdam, The Netherlands

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Quarterly Progress and Status Report. The 51-channel spectrum analyzer - a status report

Research Article Implementation of a Tour Guide Robot System Using RFID Technology and Viterbi Algorithm-Based HMM for Speech Recognition

Autonomous Vehicle Speaker Verification System

Real-Time Digital Hardware Pitch Detector

SPEECH TO SINGING SYNTHESIS SYSTEM. Mingqing Yun, Yoon mo Yang, Yufei Zhang. Department of Electrical and Computer Engineering University of Rochester

Gammatone Cepstral Coefficient for Speaker Identification

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.

OF HIGH QUALITY AUDIO SIGNALS

Speech Recognition using FIR Wiener Filter

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

Sound Recognition. ~ CSE 352 Team 3 ~ Jason Park Evan Glover. Kevin Lui Aman Rawat. Prof. Anita Wasilewska

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Fibre Laser Doppler Vibrometry System for Target Recognition

Basic Characteristics of Speech Signal Analysis

Linguistic Phonetics. Spectral Analysis

ANALYSIS OF SPEECH RECOGNITION TECHNIQUES

Aparna Tiwari, Vandana Thakre, Karuna Markam Deptt. Of ECE,M.I.T.S. Gwalior, M.P, India

3rd International Conference on Machinery, Materials and Information Technology Applications (ICMMITA 2015)

Determination of Variation Ranges of the Psola Transformation Parameters by Using Their Influence on the Acoustic Parameters of Speech

ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

Development of Real-Time Adaptive Noise Canceller and Echo Canceller

Case study for voice amplification in a highly absorptive conference room using negative absorption tuning by the YAMAHA Active Field Control system

SOUND SOURCE RECOGNITION AND MODELING

A Real Time Noise-Robust Speech Recognition System

Announcements. Today. Speech and Language. State Path Trellis. HMMs: MLE Queries. Introduction to Artificial Intelligence. V22.

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

KONKANI SPEECH RECOGNITION USING HILBERT-HUANG TRANSFORM

Change Point Determination in Audio Data Using Auditory Features

RECENTLY, there has been an increasing interest in noisy

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Revision 1.1 May Front End DSP Audio Technologies for In-Car Applications ROADMAP 2016

Speech/Music Change Point Detection using Sonogram and AANN

Voice Excited Lpc for Speech Compression by V/Uv Classification

Enhanced Waveform Interpolative Coding at 4 kbps

Envelope Modulation Spectrum (EMS)

Theoretical 1 Bit A/D Converter

Transcription:

EVICE FOR UTOTIC SPEECH RECOGNITION* ats Blomberg and Kjell Elenius INTROUCTION In the following a device for automatic recognition of isolated words will be described. It was developed at The department of Speech Communication and music acoustics at KTH and was ready by pril 1982. Research in this area has been pursued since the beginning of the seventies. The present system has earlier been implemented on minicomputers - first a Control ata Corporation 1700 and later on a ata General Eclipse S-140 (Elenius & Blomberg, 1982). The current compact system has been made possible by the rapid development in the area of integrated circuits. The most important parts of the system is a 16 bit micro processor and a single chip digital signal processor. utomatic speech recognition is among other things useful in situations where an operator is inputting data to a computer in parallel with using his hands for other tasks, e.g., when sorting goods or inspecting products. nother application is the voice control of a word processor, while inputting the text manually. The recognition strategy used can in short be described as an extraction of a number of speech parameters from the acoustic speech signal for each word. The speech parameters describe the word by their variation over time and together they build up a pattern that characterises the word. In a training phase the operator will read all the words of the vocabulary of the current application. The word patterns are stored and later when a word is to be recognised its pattern is compared to the stored patterns and the word that gives the best correspondence is selected. This technique is generally referred to as pattern recognition. BRIEF ESCRIPTION OF THE HRWRE BORS The hardware of the word recogniser consists of a general micro computer, and a signal processor for the acoustic analysis of the speech signal. The micro computer board consists of the otorola C-68000 micro processor and also has facilities for the input and output of data and memory managing circuits for the memory cards. The size of the memory boards is currently 32 kbyte - R and RO - per board, but they are made ready for different sizes. The speech analysis board implements a spectrum analyser in the form of a 16 channel filter bank. It consists of a microphone amplifier, an anti-aliasing filter, a COEC for the /-conversion of the speech signal and a signal processor NEC-7720P. minimal system has three Europe boards can recognise 24 words. dding more memory boards will increase the maximum vocabulary. * This paper is a translation of a paper originally published in the proceedings of the 1982 meeting of "Nordiska akustistka sällskapet" (The Nordic coustical Society), pp. 383-386.

2 ESCRIPTION OF THE RECOGNITION SYSTE block diagram of the system can be seen in Figure 1. The speech signal is low passed by an anti-aliasing filter and is fed to the COEC, where it is sampled at 10 khz and digitally converted, before it is taken care of by the signal processor that functions as a filter bank. The filter spacing is according to the theory of critical bands, that is based on the frequency resolution of the ear (Elenius, 1980). Sixteen band pass filters of the fourth order are calculated between 200 Hz and 5 khz. The maximal amplitude within each band is measured during a 25 ms interval. The amplitudes are then fed to the micro processor that thus will sample a spectral section of 16 amplitudes every 25 ms (40 Hz). To reduce the data a cepstrum (cosine) transform is calculated from the filter amplitudes according to the following formula (avis & ermelstein, 1980): 16 C j = k cos( j( k 0. 5) π / 16) j = 1, 2,.., 6 k = 1 k is the amplitude of filter k and C j is the cepstrum coefficient j. The term -0.5 is included to make filter number 8 influence the calculations. We now have 6 parameters describing the spectrum at each sample point of 25 ms. Since the zero-th coefficient is not calculated the parameters are in principle independent of signal amplitude. speech signal anti-aliasing filter 10 khz sample rate COEC NEC 7720P C 68000 / converter filter bank of 16 bandpass filters: 200-5000 Hz 40 Hz sample rate calculate cepstral coefficients training reference templates for used vocabulary detect word endpoints linear time normalisation recognition calc. dist. using dynamic programming select best match Figure 1 Block diagram of the recognition system. ashed lines mark boundaries between different hardware components. To decide when a word starts and ends the system is mainly sensitive to the speech signal amplitude in a low pass band. When this amplitude goes above a given threshold level the word has started and when the amplitude goes below it the word has ended. Short intervals, on the order of 200-300 ms, outside these endpoints are also included in the word. The word is then normalised to a fixed length of 32 sample points. Even though a linear normalisation is not enough to take care of all the time variation it will give a good first approximation. lso it gives all words the same nominal length which will facilitate coming calculations.

3 Reference: Stimulus: REF unnormalised linearly normalised non-linearly normalised STI Figure 2 Time mapping for pattern recognition. The stimulus word is mapped to a shorter reference word. Without normalisation phoneme of the stimulus will be mapped to the of the reference. linear mapping is substantially better even though some time differences remain. The non-linear mapping is achieved by dynamic programming. The system first has to learn the speech parameter patterns for the words of the user vocabulary. This is done in a training phase when all words are repeated a number of times and a reference template of each word is calculated as a mean value over its repetitions. The templates are then stored and are specific for each user. In the recognition phase the unknown word is then compared to all reference templates and a distance is calculated to all of them. In these calculations a certain amount of non-linearity is allowed in the mapping of the test word to the reference words, since the linear time normalisation already done will frequently not be sufficient. The dynamic programming algorithm used allows for locally stretching and shrinking the speech signal to between 50% and 200% and is described in the ppendix. It increases the amount of local distance calculations between test and reference patterns but it results in a better mapping of the time relations. fter measuring the distance to all the references the word that gives the smallest distance is selected. To avoid responses for unintentional sounds and background noise a rejection threshold is used. If the distance to all words are above the given threshold the input will be rejected. PRELIINRY RESULTS The system has been preliminary tested using one speaker and two different vocabularies - one including all Swedish consonants and one including all vowels.

4 The consonants are read in an -context, e.g.,, PP, FF. The vocabulary were read 10 times in a normal tempo to train the vocabulary. uring the recognition test the same words were used but spoken in seven different ways; normal, fast, lax, emphatic, slow plus normal with a different angle to the microphone and normal with an increased distance to the microphone. The vowel test was done in a similar way with the vowels in an H-L context, e.g., HL, HOL. In this experiment the lax and emphatic versions were not tested. The vocabularies may be considered as difficult - it is not very easy to discriminate between PP and TT, and NN or HIL and HYL. We have done the same test with three commercial, merican word recognition systems. None of them used the dynamic programming technique. Their recognition scores were 55% - 75% for the consonants and 85% - 90% for the vowels. The corresponding scores for our system were 86% and 95%. REFERENCES S. avis & P. ermelstein (1980): "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Transactions on coustics, Speech, and Signal Processing, SSP-28, No. 4, pp. 357-366. K. Elenius (1980): "Long time average spectrum using a 1/3 octave filter bank", STL-QPSR 4/1980, pp. 15-19. K. Elenius &. Blomberg (1982): "Effects of emphasizing transitional or stationary parts of the speech signal in a discrete utterance recognition system," IEEE International Conference on coustics, Speech, and Signal Processing, Paris (ICSSP 82), ay 3-5, pp. 535-538. H. Sakoe & S. Chiba (1978): "ynamic programming algorithm optimisation for spoken word recognition," IEEE Transactions on coustics, Speech, and Signal Processing, SSP-26, pp. 43-49.

5 PPENIX The algorithm used for the dynamic programming is described in Sakoe & Chiba, 1978. Here we give a short summary of the method. The linear time normalisation results in the same length (number of sample points) for all word patterns. We assume having two word patterns and B represented by a sequence of cepstrum vectors in the sample points 1 to N (N = 32). = a 1, a 2,..., a N B = b 1, b 2,..., b N The distance d(i,j) between vectors a i and b j is: P d( i, j) = a, b, p= 1 i p j p P = 6, the number of cepstral coefficients used. The accumulated distance g(i,j) from the beginning up to a time position i,j between patterns and B is recursively calculated according to: g( i 1, j 2) + 2d( i, j i) + d( i, j) g( i, j) = min g( i 1, j 1) + 2d( i, j) g( i 2, j 1) + 2d( i 1, j) + d( i, j) oreover we have the boundary condition that the first and the last point of patterns and B must be mapped on each other. The overall distance between the patterns is then:(, B) = g( N, N) To reduce the number of distance calculations a time window r is introduced: i j r It defines the maximally allowed deviation from a linear time normalisation, which is given by the line i=j in our case. reasonable value of r is 3 sample points. The time for matching two patterns will then be 50 ms. The time for doing a linear time normalisation is only 4 ms.