Auditory Based Feature Vectors for Speech Recognition Systems

Similar documents
Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Speech Signal Analysis

Gammatone Cepstral Coefficient for Speaker Identification

Signals & Systems for Speech & Hearing. Week 6. Practical spectral analysis. Bandpass filters & filterbanks. Try this out on an old friend

AN AUDITORILY MOTIVATED ANALYSIS METHOD FOR ROOM IMPULSE RESPONSES

Mel- frequency cepstral coefficients (MFCCs) and gammatone filter banks

Hearing and Deafness 2. Ear as a frequency analyzer. Chris Darwin

Speech Synthesis using Mel-Cepstral Coefficient Feature

Using the Gammachirp Filter for Auditory Analysis of Speech

HCS 7367 Speech Perception

T Automatic Speech Recognition: From Theory to Practice

Acoustics, signals & systems for audiology. Week 4. Signals through Systems

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Auditory modelling for speech processing in the perceptual domain

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Psycho-acoustics (Sound characteristics, Masking, and Loudness)

Applying Models of Auditory Processing to Automatic Speech Recognition: Promise and Progress!

Feasibility of Vocal Emotion Conversion on Modulation Spectrogram for Simulated Cochlear Implants

Signals, Sound, and Sensation

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

SPEech Feature Toolbox (SPEFT) Design and Emotional Speech Feature Extraction

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

SIGNAL PROCESSING FOR ROBUST SPEECH RECOGNITION MOTIVATED BY AUDITORY PROCESSING CHANWOO KIM

Phase and Feedback in the Nonlinear Brain. Malcolm Slaney (IBM and Stanford) Hiroko Shiraiwa-Terasawa (Stanford) Regaip Sen (Stanford)

An Improved Voice Activity Detection Based on Deep Belief Networks

Cepstrum alanysis of speech signals

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

A Pole Zero Filter Cascade Provides Good Fits to Human Masking Data and to Basilar Membrane and Neural Data

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

MOST MODERN automatic speech recognition (ASR)

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

DERIVATION OF TRAPS IN AUDITORY DOMAIN

You know about adding up waves, e.g. from two loudspeakers. AUDL 4007 Auditory Perception. Week 2½. Mathematical prelude: Adding up levels

Linguistic Phonetics. Spectral Analysis

Machine recognition of speech trained on data from New Jersey Labs

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

Robust Speech Recognition Based on Binaural Auditory Processing

Robust Speech Recognition Based on Binaural Auditory Processing

Human Auditory Periphery (HAP)

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

Can binary masks improve intelligibility?

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Signals A Preliminary Discussion EE442 Analog & Digital Communication Systems Lecture 2

Complex Sounds. Reading: Yost Ch. 4

Speech Synthesis; Pitch Detection and Vocoders

Robust Speech Feature Extraction using RSF/DRA and Burst Noise Skipping

Monaural and binaural processing of fluctuating sounds in the auditory system

CS 188: Artificial Intelligence Spring Speech in an Hour

Tones in HVAC Systems (Update from 2006 Seminar, Quebec City) Jerry G. Lilly, P.E. JGL Acoustics, Inc. Issaquah, WA

Multichannel level alignment, part I: Signals and methods

Speech Production. Automatic Speech Recognition handout (1) Jan - Mar 2009 Revision : 1.1. Speech Communication. Spectrogram. Waveform.

EE482: Digital Signal Processing Applications

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Isolated Digit Recognition Using MFCC AND DTW

REAL-TIME BROADBAND NOISE REDUCTION

A comparative study on feature extraction techniques in speech recognition

Chapter 4 SPEECH ENHANCEMENT

FFT 1 /n octave analysis wavelet

Testing of Objective Audio Quality Assessment Models on Archive Recordings Artifacts

Topic. Spectrogram Chromagram Cesptrogram. Bryan Pardo, 2008, Northwestern University EECS 352: Machine Perception of Music and Audio

SOUND QUALITY EVALUATION OF FAN NOISE BASED ON HEARING-RELATED PARAMETERS SUMMARY INTRODUCTION

Speech and Music Discrimination based on Signal Modulation Spectrum.

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

Advanced audio analysis. Martin Gasser

Electronic disguised voice identification based on Mel- Frequency Cepstral Coefficient analysis

Automatic Speech Recognition handout (1)

Auditory filters at low frequencies: ERB and filter shape

Objectives. Presentation Outline. Digital Modulation Lecture 03

Signal Processing for Robust Speech Recognition Motivated by Auditory Processing

arxiv: v1 [eess.as] 30 Dec 2017

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Harmonic Analysis. Purpose of Time Series Analysis. What Does Each Harmonic Mean? Part 3: Time Series I

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients

An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation

I M P L I C AT I O N S O F M O D U L AT I O N F I LT E R B A N K P R O C E S S I N G F O R A U T O M AT I C S P E E C H R E C O G N I T I O N

CHAPTER 2 FIR ARCHITECTURE FOR THE FILTER BANK OF SPEECH PROCESSOR

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

2920 J. Acoust. Soc. Am. 102 (5), Pt. 1, November /97/102(5)/2920/5/$ Acoustical Society of America 2920

Problem Sheet 1 Probability, random processes, and noise

Speech Enhancement and Noise-Robust Automatic Speech Recognition

JOURNAL OF OBJECT TECHNOLOGY

IMPROVING MICROPHONE ARRAY SPEECH RECOGNITION WITH COCHLEAR IMPLANT-LIKE SPECTRALLY REDUCED SPEECH

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM

Tone-in-noise detection: Observed discrepancies in spectral integration. Nicolas Le Goff a) Technische Universiteit Eindhoven, P.O.

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Separating Voiced Segments from Music File using MFCC, ZCR and GMM

Investigating Modulation Spectrogram Features for Deep Neural Network-based Automatic Speech Recognition

2.1 BASIC CONCEPTS Basic Operations on Signals Time Shifting. Figure 2.2 Time shifting of a signal. Time Reversal.

Spectro-Temporal Methods in Primary Auditory Cortex David Klein Didier Depireux Jonathan Simon Shihab Shamma

A102 Signals and Systems for Hearing and Speech: Final exam answers

AUDL GS08/GAV1 Signals, systems, acoustics and the ear. Loudness & Temporal resolution

Transcription:

Auditory Based Feature Vectors for Speech Recognition Systems Dr. Waleed H. Abdulla Electrical & Computer Engineering Department The University of Auckland, New Zealand [w.abdulla@auckland.ac.nz] 1

Outlines Introduction ASR Systems and Signal Modelling The Human Ears Equivalent rectangular band (ERB) The Gammatone Filterbank (GTF) Speech Signal Analysis based on GTF Classification Evaluation Conclusions 2

Introduction Automatic speech recognition (ASR) is the process of converting an incoming acoustic signal to its corresponding stream of words. ASR systems can be: Speaker Dependent OR Speaker Independent Isolated Words OR Continuous Limited vocabulary OR Large vocabulary Resticted Domain OR Unrestricted Domain 3

Introduction The general paradigm of speech recognition systems comprises two main parts: front-end and back-end Signal Processing Part Statistical Modeling Part 4

Block diagram of the ASR systems ASR systems comprises: Speech Dataset Training Recognition s n Feature Extraction Recognition Phase O t Recognition W O t Training HMM Initial Training set Trainin, q t q t Training Phase 5

Speech Signal Processing Speech Signal Digital Filterbank Wavelets Fourier Transform Linear Prediction Power Estimation Mel Filterbank Fourier Transform Cepstrum Cepstrum PLP Coding Reflection Coefficients 6

MFCC & PLP Filterbanks MFCC Filterbank PLP Filterbank 7

Signal Modelling Feature Extraction Speech Signal s n Sampling Frequency 22050 Hz Preemphasis H(z)=1-0.97 z -1 Sampling at 9 ms rate Hanning Window 23 ms Time Domain Frequency Domain Power & 12 MFCC Normalised MFCC 12 Coefficients 36 & 72 ms Delta MFCC Delta-Delta MFCC Normalised Power 72 ms Delta Power Delta-Delta Power Feature Vectors Concatenation OR Streaming O t 8

The Structure of the Human Ears 9

Human Basilar Membrane 10

Cochlea characteristic frequency for different species In 1961, Don Greenwood developed a mathematical function relating the characteristic frequency, fc, at any location along the length of the cochlea to the distance, x, from the apex (Greenwood 1961). The function is: f c A(10 ax / L K) Where: A is a high frequency control constant L is the cochlea length in (mm) a is the slop factor K is the low frequency control constant 11

Reverse Correlation (Revcor) technique Revcore technique states that, for a linear system, it is possible to extract the system parameters by operations on stochastic input and output signals (de-boer and H. R. de Jongh 1978). The revcore function can be represented mathematically by the equation: g(t) t m e t cos( t) (a) (b) Amplitude e ff(t) Time (samples) Frequency 12

Critical band and equivalent rectangular bandwidth Critical band (CB) is the bandwidth of the human auditory filter at different characteristic frequencies positioned along the cochlea path. The bandwidth of the human auditory filter can be measured psycho-acoustically in masking experiments using a sine wave signal (single tone) and a broadband noise as a masker. Experiments show that sounds can be distinguished by ear only if they fall into different critical bands, and they practice the masking process on each other when they fall into the same critical band. H(f) 2 actual filter frequency 13

Equivalent rectangular bandwidth (ERB) The bandwidth of the actual auditory filter can be related to an equivalent rectangular bandwidth (ERB) filter that has a unit height and a bandwidth ERB. It passes the same power as the real filter does when subjected to a white noise input. ERB 0 H(f ) 2 df 14

Formulae for the ERB Various formulas have been derived for the ERB values: Zwicker 1961 ERB 2 0.69 1 25 75(1 1.4f c ) Glasberg and Moore 1990 ERB2 24.7(1 4.37f c ) Moore and Glasberg 1983 ERB3 6.23f c 93.39f c 2 28.52 15

Comparison of Different ERB Functions 16

General Formula for ERB ERB é m æ f c ö = + BW Q ê çè ø ë m min 1/ m ù úû Where f c is the centre frequency, Q is the ear quality factor, which is the ratio between the centre frequency and its corresponding filter bandwidth, BW min is the minimum bandwidth allowed, and m is the order. Lyon recommended the following parameters (Slaney 1988): Q = 8, BW min = 125 Hz, and m = 2 to produce ERB Ly ERB Ly é 2 æ f ö c = ç + 125 çè 8 ê ø ë 2 ù úû 17

General Formula for ERB Greenwood recommended: Q = 7.24, BW min = 22.85, m = 1 to form ERB Gr fc ERB Gr = + 22.85 7.24 Glassberg and Moore (Glasberg and Moore 1990) recommended Q = 9.26, BW min = 24.7, m = 1 to get ERB GM ERB GM fc = + 24.7 = 24.7(1+ 0.00437 fc) 9.26 ERB GM is used in our approach as it approximates most of the other estimates. 18

Comparison Between Three ERB Definitions ERB ERB ERB Ly Gr GM 2 f c 2 125 8 f c 22.85 7.24 24.7(1 0.00437fc ) 19

Critical Band Number For a certain frequency, it represents the number of critical bands required until reaching that frequency. Let us consider the change in the critical-band number, z, as the frequency changes by df is given by: Dz 1 1 dz = df = df = df Df Df / Dz ERB( f ) z f c = ò 0 1 df ERB( f ) For ERB( f ) = 24.7(1+ 4.37 f ) fc 1 z= ò df = 0.00926ln(4.37 f c + 1) 24.7(1+ 4.37 f ) 0 20

Gammatone Filters The impulse response of these filters ht nbt e wt ut n-1 -bt () = g(, ) cos( + f) () 21

ERB of Gammatone Filters 2 ERB = ò H ( f ) df 0 ( n -1)! 1 H( f) =. 2 é êë b + 4 p ( f - f ) 2 2 2 c ù úû n /2-2( n-1) 2 ERB = 2 p( n -1)! b 2 ( 1)! [ n - ] For n = 4, ERB = 0.9817b b = 1.0186 ERB 22

Number of Channels and the Overlapping Spacing z f f H = ò L 1 df ERB( f ) ERB f = + BWmin For m = 1, Q f H Q fh + QB z= ò df = Qln f + BQ f + QB f L L where B = BW min If the overlapping factor between the contiguous filters is then the number of channels, N, is related to z, as follows: z = N. v Q fh + QB 9.26 fh + 228.7 N =.ln = ln v f + QB v f + 228.7 L L For Q = 9.26 and B = 24.7 23

Gammatone Filterbank For a certain band f L f H with overlapping between filters N = 9.26 f H + 228.7 ln v f + 228.7 L For 1 n N f c ( n ) 228.7 ( f H 228.7 ) e vn 9.26 ERB ( n ) 24.7 1 4.37 f ( n c ) 24

Characteristics of the GTF 25

Gammatone Filterbank Frequency response of a 30-channel filterbank, covering 200-11025 Hz band 26

Gammatone Filterbank Amplitude Time in samples Filter number is on the lower right corner Impulse responses of a 20-filters Gammatone filterbank. 27

Equal Loudness Contours This graph shows that the ear is not equally sensitive to all frequencies ISO recommendation R226 of equal loudness contours for pure tones and normal threshold of hearing for persons aged 18-25 years. 28

Equal Loudness Preemphasis Filter The non-uniformity of the loudness sensing can be compensated for by a filter with the following transfer function 4 2 6 E ( 56.810 ) ( ) 2 6 2 2 9 6 ( 6.310 ).( 0.3810 ).( 9.5810 26 ) 29

Gammatone Filterbank Toward filter 20 Toward filter 1 Amplitude frequency responses of a 20-filters Gammatone filterbank after subjecting the filters to the equal loudness pre-emphasis filter. 30

Speech Signal GTF Frequency Analysis Speech Signal Frames Speech signal analysis of a spoken digit 9 using 30 Gammatone filters. (a) Spectra of the speech signal, (b) Log spectra of the speech signal. 31

Feature Extraction Paradigms Speech Signal Speech Signal Gammatone Filterbank Gammatone Filterbank Equal Loudness Pre-emphasis Equal Loudness Pre-emphasis (a) Intensity - Loudness Power Law LOG{ } (b) Inverse Discrete Fourier Transform Inverse Discrete Cosine Transform Auto Regressive Mod elling Smoothing Gamma-PLP Coefficients Gamma-Cepst Coefficients Block diagrams of two feature extraction paradigms. 32

Feature Evaluation Based on F-Ratio F-ratio is a measure of the feature effectiveness. It is the ratio of the between class variance (B) to the within class variance (W). For the i th feature in the j th class of K classes: F B i i Bi W i 1 K K j1 ( ij ) i 2 W i 1 K K j1 W ij 33

F-Ratio Based on HMM HMM satisfies the F-ratio conditions Features have Gaussian distribution. Diagonal covariance implies uncorrelated features For K states in each model and for H models we have: F ave 1 H H i1 F i 34

F-Ratio Characteristics F-ratio Mean F-ratio Q static delta delta-delta F-ratio of the between states procedure. The thick red line indicates the mean of the between states F-ratio. 35

Performance Evaluation F-ratio static delta delta-delta Q Classification properties based on F-ratio calculations of different feature extraction paradigms. 36

Feature Rank F-ratio MFCC Rank F-ratio GTCC Rank F-ratio GTPLP 1 2 4.46 3 5.19 2 4.8 2 4.65 2 1 2.59 2 4.68 1 3.78 1 3.92 3 15 2.59 1 3.84 3 3.62 14 2.67 4 6 2.12 4 2.86 14 2.58 6 2.5 5 14 1.75 15 2.63 15 2.22 15 2.4 6 4 1.72 16 2.21 6 2.17 4 1.96 7 5 1.53 14 2.2 16 2.12 3 1.87 8 19 1.45 7 1.92 4 1.79 19 1.66 9 17 1.27 17 1.68 5 1.65 5 1.64 10 28 1.26 5 1.61 19 1.4 17 1.28 11 3 1.11 6 1.45 7 1.34 16 1.28 Rank F-ratio PLP 32 12 0.23 11 0.23 34 0.26 25 0.25 33 36 0.22 13 0.21 37 0.21 23 0.23 34 13 0.17 38 0.19 36 0.21 37 0.2 35 37 0.16 35 0.19 35 0.18 38 0.16 36 25 0.15 26 0.18 12 0.18 13 0.15 37 26 0.14 36 0.17 25 0.17 36 0.15 38 38 0.1 37 0.08 39 0.15 26 0.14 39 39 0.08 39 0.07 38 0.12 39 0.08 Mean F- ratio 0.8839 1.1471 1.0753 0.9862 37

a MFCC13 b PLP13 c GTCC13 d GTPLP13 e Shows the states of the word three as detected by its four static features based CDHMMs. (a) Model MFCC13 is constructed from 13 static mel scale coefficients. (b) Model PLP13 is constructed from 13 static perceptual linear prediction coefficients. (c) Model GTCC13 is constructed from 13 static Gammatone cepstral coefficients. (d) Model GTPLP13 is constructed from 13 static Gammatone PLP coefficients. (e) The spectrogram of the input signal to envisage the frequency content of each state. 38

Classification Performance Absolute threshold recognition Margin Spoken words other than zero zero MFCC 21.51 PLP 21.78 GTPLP 25.14 GTCC 28.95 39

Recognition Rate Performance DATASET-I DATASET-II Mel-cespt 100 95.2 PLP 100 96.1 Gamma-PLP 100 97.8 Gamma-cepst 100 98.9 DATASET-I : 10 digits DATASET-II : 31 words S/N ratio = 20 db 40

Conclusions Efficient auditory motivated technique is introduced. It is mainly based on Gammatone filterbank (GTF). GTF composed of non uniform bandpass filters imitating the frequency resolution of the cochlea. Two paradigms: Gamma-cepst and Gamma-PLP are investigated. Classification performance based on the F-ratio figure of merit has been investigated as it is a strong cue to the recognition performance. Gamma-cepst feature set outperforms the other feature sets. 41

42