NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or

Similar documents
EE482: Digital Signal Processing Applications

Overview of Code Excited Linear Predictive Coder

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Digital Speech Processing and Coding

Speech Enhancement using Wiener filtering

Introduction to cochlear implants Philipos C. Loizou Figure Captions

Mel Spectrum Analysis of Speech Recognition using Single Microphone

EC 6501 DIGITAL COMMUNICATION UNIT - II PART A

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

CHAPTER 3 Syllabus (2006 scheme syllabus) Differential pulse code modulation DPCM transmitter

QUESTION BANK EC 1351 DIGITAL COMMUNICATION YEAR / SEM : III / VI UNIT I- PULSE MODULATION PART-A (2 Marks) 1. What is the purpose of sample and hold

Speech Coding Technique And Analysis Of Speech Codec Using CS-ACELP

KONKANI SPEECH RECOGNITION USING HILBERT-HUANG TRANSFORM

QUESTION BANK SUBJECT: DIGITAL COMMUNICATION (15EC61)

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

techniques are means of reducing the bandwidth needed to represent the human voice. In mobile

ECE 556 BASICS OF DIGITAL SPEECH PROCESSING. Assıst.Prof.Dr. Selma ÖZAYDIN Spring Term-2017 Lecture 2

APPLICATIONS OF DSP OBJECTIVES

Speech Coding using Linear Prediction

Determination of instants of significant excitation in speech using Hilbert envelope and group delay function

Speech Synthesis using Mel-Cepstral Coefficient Feature

MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS

SPEECH AND SPECTRAL ANALYSIS

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

Chapter IV THEORY OF CELP CODING

Pulse Code Modulation

Audio Signal Compression using DCT and LPC Techniques

Communications Theory and Engineering

The Channel Vocoder (analyzer):

Speech Compression Using Voice Excited Linear Predictive Coding

Speech Perception Speech Analysis Project. Record 3 tokens of each of the 15 vowels of American English in bvd or hvd context.

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Speech Recognition. Mitch Marcus CIS 421/521 Artificial Intelligence

Waveform Encoding - PCM. BY: Dr.AHMED ALKHAYYAT. Chapter Two

Cellular systems & GSM Wireless Systems, a.a. 2014/2015

Improving Sound Quality by Bandwidth Extension

Linguistic Phonetics. Spectral Analysis

X. SPEECH ANALYSIS. Prof. M. Halle G. W. Hughes H. J. Jacobsen A. I. Engel F. Poza A. VOWEL IDENTIFIER

EE 225D LECTURE ON MEDIUM AND HIGH RATE CODING. University of California Berkeley

Vocoder (LPC) Analysis by Variation of Input Parameters and Signals

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

IMPROVING QUALITY OF SPEECH SYNTHESIS IN INDIAN LANGUAGES. P. K. Lehana and P. C. Pandey

An Approach to Very Low Bit Rate Speech Coding

ON-LINE LABORATORIES FOR SPEECH AND IMAGE PROCESSING AND FOR COMMUNICATION SYSTEMS USING J-DSP

Epoch Extraction From Emotional Speech

Chapter 4 SPEECH ENHANCEMENT

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

SPEECH TO SINGING SYNTHESIS SYSTEM. Mingqing Yun, Yoon mo Yang, Yufei Zhang. Department of Electrical and Computer Engineering University of Rochester

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Adaptive Filters Application of Linear Prediction

Fundamentals of Digital Communication

Simulation of Conjugate Structure Algebraic Code Excited Linear Prediction Speech Coder

Digital Audio. Lecture-6

Real-Time Digital Hardware Pitch Detector

Calibration of Microphone Arrays for Improved Speech Recognition

Speech Synthesis; Pitch Detection and Vocoders

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Voiced/nonvoiced detection based on robustness of voiced epochs

SGN Audio and Speech Processing

Auditory modelling for speech processing in the perceptual domain

Comparison of CELP speech coder with a wavelet method

A DEVICE FOR AUTOMATIC SPEECH RECOGNITION*

Block diagram of proposed general approach to automatic reduction of speech wave to lowinformation-rate signals.

ENEE408G Multimedia Signal Processing

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

ECMA TR/105. A Shaped Noise File Representative of Speech. 1 st Edition / December Reference number ECMA TR/12:2009

Hungarian Speech Synthesis Using a Phase Exact HNM Approach

Enhanced Waveform Interpolative Coding at 4 kbps

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

RECENTLY, there has been an increasing interest in noisy

Impact of the GSM AMR Speech Codec on Formant Information Important to Forensic Speaker Identification

6/29 Vol.7, No.2, February 2012

QUANTIZATION NOISE ESTIMATION FOR LOG-PCM. Mohamed Konaté and Peter Kabal

Voice Excited Lpc for Speech Compression by V/Uv Classification

Voice Transmission --Basic Concepts--

A Parametric Model for Spectral Sound Synthesis of Musical Sounds

Fundamental frequency estimation of speech signals using MUSIC algorithm

Multi-Band Excitation Vocoder

OF HIGH QUALITY AUDIO SIGNALS

COMPRESSIVE SAMPLING OF SPEECH SIGNALS. Mona Hussein Ramadan. BS, Sebha University, Submitted to the Graduate Faculty of

651 Analysis of LSF frame selection in voice conversion

-/$5,!4%$./)3% 2%&%2%.#% 5.)4 -.25

Universal Vocoder Using Variable Data Rate Vocoding

Pitch Period of Speech Signals Preface, Determination and Transformation

Transcoding of Narrowband to Wideband Speech

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.

h~hhhi E7uhhhhh mhhhhhhhhhhhhil EhhhohhhmhhhhE lieumomom

L19: Prosodic modification of speech

Image Compression using DPCM

A LPC-PEV Based VAD for Word Boundary Detection

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Improved signal analysis and time-synchronous reconstruction in waveform interpolation coding

SGN Audio and Speech Processing

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

Transcription:

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or other reproductions of copyrighted material. Any copying of this document without permission of its author may be prohibited by law.

CriU-CS-78-123 PERFORMANCE OF HARPY SPEECH RECOGNITION SYSTEM FOR SPEECH INPUT WITH QUANTIZATION NOISE. B. Yegnanarayana and D. Raj Reddy Department of Computer Science Carnegie-Mellon University Pittsburgh, PA 15213 May 1978 This work was supported by the Defense Advanced Research Projects Agency under contract F44620-73-C-0074 and is monitored by the Air Force Office of Scientific Research.

ABSTRACT One of the major problems of a speech processing system is the degradation in performance it suffers due to distortions in the speech input. One such distortion is caused by the quantization noise of waveform encoding schemes which have several attractive features for speech transmission. The objective of this study is to evaluate the performance of the HARPY continuous speech recognition system when the speech input to the system is corrupted by the quantization noise of an ADPCM (Adaptive Differential Pulse Code Modulation) system. The effect of quantization noise on the segmentation and the estimation of LPC (Linear Predictor Coefficients) based paramenters is studied for different bit rates in the range 20-50 kbs of the ADPCM system and the overall word and sentence recognition accuracies are evaluated. The results indicate that even 2-bit ADPCM (corresponding to 2 0 kbs) speech does not cause significant degradation in performance. The results are explained on the basis of changes produced by the quantization noise in spectral shape and LPC distance.

I. INTRODUCTION Waveform encoding techniques are generally adopted for efficient transmission of speech information over digital channels. In these cases the signal is corrupted with the quantization noise introduced by the coding scheme. Although many low bit rate schemes have been found to yield perceptually acceptable speech\ the effect of the accompanying quantization distortion on the performance of speech processing systems such as speech and speaker recognition systems has not been reported. The objective of this paper is to investigate this problem. The speech processing system considered for investigation is the Harpy 2 continuous speech recognition system developed at Carnegie-Melion University. We consider the model of Harpy system designed for a 1011 word AI abstract retrieval task. In the system the syntactic, lexical and word juncture knowledge are combined together into one integral network representation. The network consists of a set of states and inter-state pointers. Each state has associated with it phonetic, lexical and duration The pointers indicate what states may follow a given state. information. The initial and final states indicate the beginning and ending points of all utterances respectively. The network is thus a complete (and pre-compiled) representation of all possible pronounciations of all possible utterances in the task language. The recognition process is based on the locus model of search in which all but a narrow beam of paths around the most likely path through the network are rejected. Recognition process in the Harpy system is as follows: Speech data is sampled at 10kHz and digitized to 9 bits/sample. The sampled data is segmented into acoustically similar sound units based on analyses performed on

-2- successive 10 msec segments using Itakura distance metric J. A more recent version of the system incorporates segmentation procedure based on ZAPDASH (^Zerocrossings And Peaks of Differenced And _SmootH waveform) parameters^ which reduce the computation time for segmentation. Autocorrelation and linear prediction coefficients (LPC) are extracted from the center 10 msec portion of each segment. 3 The segments are then mapped to the network states based on a distance stored templates. match between the LPC data' of the segments and of The mapping scheme used is a modified graph search in which heuristics are used to reduce the number of paths that are checked. As can be seen, any distortion in the input speech can affect at several stages in the recognition process. The segmentation procedures are likely to produce segment boundaries for an utterance different from those in the undistorted speech. The parameters extracted from the segments will also be different and hence a set of templates different from the original ones will be produced. Finally, the distances used for labeling may also be affected causing difficulty in matching the segments to proper network states. II. GENERATION OF DISTORTED SPEECH DATA To study the above mentioned effects on the overall recognition performance of Harpy system we consider quantization noise produced in an ADPCM scheme. The distorted speech is generated as shown in Fig. 1. The scheme uses a feedback adaptive quantization and time invariant first order predictor. Variance adaptive quantization is provided by observing the statistics of the quantizer output and the specification of a corresponding optimum step-size A. The variance is computed over 64 samples. The following opt equations define the differential coding^1

-3- Input speech samples E n x nq Prediction error samples Quantized input speech samples E nq Quantized error Bits per sample samples n X n - A l x (n-l)q nq x (n-1)q + E nq 11/2 La v nq Al X (n-l)q r opt = K opt n=2 N-l where = 0.875 and K q t for different values of B are as shown in Table 1. Table 1. Design values for ADPCM scheme (from Ref. 5). (sampling frequency = 10 khz) Bits per sample B 2 3 4 5 Bit rate I (kbps) 20 30 40 50 K _ opt.996.586.335.225 III. RECOGNITION ON HARPY SYSTEM Speech data consisting of 55 sentences for training and 2 0 sentences for testing was recorded using a close speaking microphone. The signal was sampled at 10 khz sampling rate after passing through a pre-filter (85-4500 Hz). The samples were digitized and stored as 9 bits per sample. ADPCM speech data was generated from these stored samples for the four cases listed in Table 1.

-4- The phone templates for the ADPCM data are generated as follows: The Harpy system is run in a forced recognition mode with a previously generated set of templates for the undistorted speech data. This produces a parsing of the phones to acoustic data. The parsings are used to locate the autocorrelation data for averaging to generate templates for each phone. The averaged templates are tuned further by rejecting the autocorrelation sets that do not fall within + 1.2 c (a is the standard deviation) from the average and computing the average of the remaining sets. The Harpy system is finally run for recognition of both the training and test data sets. The recognition scores were obtained for both the original and ADPCM data using their respective tuned templates. The overall recognition results are summarized in Tables 2 and 3. TABLE 2 RECOGNITION RESULTS FOR ORIGINAL AND ADPCM (B=2) DATA data word recognition sentence recognition ORIGINAL Training 98.2(112 114) 94.0(189 201) 90.5(1 9 21 ) 88.2(30 34) Test 92.2(71 77 ) 90.0(18 20) ADPCM Training 95.6(1 09 114) 96.0(1 93 201) 81.0(17 21) 91-2(31 34) Test 97.4(75 77 ) 95.0(1 9 20)

TABLE 3 RECOGNITION RESULTS ON TEST DATA data word recognition sentence recognition original 92.2(71 77) 90(1 8 20) ADPCM B=2 97.4(75 77) 95(1 9 20) ADPCM B=3 97.4(75 77) 95(1 9 20) ADPCM B=4 94.8(73 77) 90(1 8 20)?IV. RESULTS AND DISCUSSION The results in Tables 2 and 3 show that the Harpy system performs equally well for ADPCM speech even with 2 bit coding. This may be due to the fact that the system tunes the templates for each kind of data. -5- Moreover the system uses several sources of knowledge and heuristics to take care of sources of variability such as speaker, noise and distortion. However, if higher accuracy or larger vocabulary systems are built using the finer details of the acoustic data, then the recognition accuracy with distorted speech may not match the performance with the undistorted data. One of the reasons for obtaining similar recognition performance with distorted and undistorted speech data is probably because most of the spectral information needed for generating phone templates is preserved in the distorted version. Although there is a change in the spectral characteristics of phonemes, as evidenced by the LPC distances, the relative spectral variations among phonemes must have been preserved even in the presence of quantization noise. We have investigated this aspect by observing the short time smoothed spectra of different speech segments and the distance between them. Figs. 2 and 3 show spectra for two different segments of speech for

-6- the four types of ADPCM. In each case the smoothed spectrum (dotted line) of the original data segment is also shown for comparison. It is interesting to note that spectral differences caused by the quantization noise are mainly in the low amplitude regions of the spectrum. The significant formant information is mostly retained even for the lowest bit rate (B = 2) ADPCM speech. 3 LPC distance contours between the original and the ADPCM speech for the utterance "PLEASE HELP ME" is shown in Fig. 4. As expected, the distance between the lowest bit rate ADPCM data and the original is the largest. In order to see how well the relative spectral differences are maintained, the distance contours obtained by comparing adjacent frames is plotted in Fig. 5. It can be seen that in this one frame shift the relative spectral variations are preserved although the absolute distance is smaller for the distorted data. V. CONCLUSIONS Speech recognition performance by the Harpy system is not affected significantly by the quantization noise of ADPCM speech. This is probably due to the fact that the system uses several sources of knowledge. Moreover, the system tunes the templates for each kind of data. We have observed that, although the spectral shape is altered due to ADPCM coding, the relative spectral differences among phonemes are preserved as demonstrated in the LPC distance contours. However, if the system is to be designed for higher accuracy or for larger vocabulary, then the finer details of acoustic data may be needed to realize the desired objective. In such a case the perfor- 4 mance with distorted speech input may not achieve a comparable recognition performance.

REFERENCES 1. N. S. Jayant, ''Digital coding of speech waveforms- PCM DPCM and mt Quantizers/' Proc. IEEE, vol. 62, May 1974, pp. 611.632! 2 ' J;J' L 7" r e ' T h e H a rpy S P e e c h Recognition System, Ph.D. Dissertation Dept. of Computer Science,.Carnegie-Mellon University, PitLburgh! Z % 3. F. Itakura, "Minimum prediction residual principle applied to speech recognition, 11 IEEE Trans. Acoust. Speech and Signal Processing, vol. ASSP-23, February 1975, pp. 67-72. -7- H. Goldberg, D. R. Reddy and G. Gill, lf The ZAPDASH parameters, feature extraction, segmentation, and labelling for speech understanding systems," in CM 77 Su. 5. M. R. Sambur and N. S. Jayant, "LPC analysis/synthesis from speech inputs containing quantization noise or additive white noise," IEEE Trans. Acoust. Speech and Signal Processing, vol. ASSP-24, December 1976, pp. 488-494. CARfcEGIE-MEUON U/^RSITV

s~rep SIZE COMPUTER it B-bit QuaTYtlzer To Ertcodtn First Order Predictor N - Sam pi Bulfer Speech Sa.-mples.]. Generation ADPCM Data

80 0 1 2 3 4 FREQUENCY IN khz Fig.2. LP SmoolT\ed Spectra for a vowel segment oi ADPCM data

83 70 h 63 h 50 h 40 h 38 h 20 h 10 h 2 FREQUENCY IN khz LP Smoothed Spectra "for ar» unvoiced segment of ADPCM data

150 r PLEASE HEXP ME»ooh»50 r 100 h B =3 150 too 50 B =4 0 l-s. <N 150 100 50 A A 20 40 FRAME NUMBER LPC -distance 60 contours B = 5 between original and ADPCri data UNIVERSITY i.rpnepifs CARNEGIE-MELLON UNIVERSITY PITTSBURGH. PENNSYLVANIA 15213