DERIVATION OF TRAPS IN AUDITORY DOMAIN

Similar documents
Using RASTA in task independent TANDEM feature extraction

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

I D I A P. Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Gammatone Cepstral Coefficient for Speaker Identification

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Audio Fingerprinting using Fractional Fourier Transform

Machine recognition of speech trained on data from New Jersey Labs

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Isolated Word Recognition Based on Combination of Multiple Noise-Robust Techniques

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

Reverse Correlation for analyzing MLP Posterior Features in ASR

Change Point Determination in Audio Data Using Auditory Features

Isolated Digit Recognition Using MFCC AND DTW

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

PLP 2 Autoregressive modeling of auditory-like 2-D spectro-temporal patterns

Power Function-Based Power Distribution Normalization Algorithm for Robust Speech Recognition

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

Non-Uniform Speech/Audio Coding Exploiting Predictability of Temporal Evolution of Spectral Envelopes

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

SGN Audio and Speech Processing

Auditory Based Feature Vectors for Speech Recognition Systems

MURDOCH RESEARCH REPOSITORY

Speech Synthesis using Mel-Cepstral Coefficient Feature

Biomedical Signals. Signals and Images in Medicine Dr Nabeel Anwar

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Chapter 4 SPEECH ENHANCEMENT

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER

Corso di DATI e SEGNALI BIOMEDICI 1. Carmelina Ruggiero Laboratorio MedInfo

EE 422G - Signals and Systems Laboratory

Speech and Music Discrimination based on Signal Modulation Spectrum.

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

B.Tech III Year II Semester (R13) Regular & Supplementary Examinations May/June 2017 DIGITAL SIGNAL PROCESSING (Common to ECE and EIE)

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Electronic disguised voice identification based on Mel- Frequency Cepstral Coefficient analysis

WIND NOISE REDUCTION USING NON-NEGATIVE SPARSE CODING

Auditory modelling for speech processing in the perceptual domain

Testing of Objective Audio Quality Assessment Models on Archive Recordings Artifacts

CHAPTER 2 FIR ARCHITECTURE FOR THE FILTER BANK OF SPEECH PROCESSOR

Overview of Code Excited Linear Predictive Coder

An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation

Blind Dereverberation of Single-Channel Speech Signals Using an ICA-Based Generative Model

Spectro-temporal Gabor features as a front end for automatic speech recognition

Improving Word Accuracy with Gabor Feature Extraction Michael Kleinschmidt, David Gelbart

Hierarchical and parallel processing of auditory and modulation frequencies for automatic speech recognition

Perceptual Speech Enhancement Using Multi_band Spectral Attenuation Filter

SGN Audio and Speech Processing

Adaptive Filters Application of Linear Prediction

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

SIGNAL PROCESSING FOR ROBUST SPEECH RECOGNITION MOTIVATED BY AUDITORY PROCESSING CHANWOO KIM

ADSP ADSP ADSP ADSP. Advanced Digital Signal Processing (18-792) Spring Fall Semester, Department of Electrical and Computer Engineering

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Bandwidth Extension for Speech Enhancement

Modulation Spectrum Power-law Expansion for Robust Speech Recognition

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

Speech Signal Analysis

Time-Frequency Distributions for Automatic Speech Recognition

Reference Manual SPECTRUM. Signal Processing for Experimental Chemistry Teaching and Research / University of Maryland

Sound Recognition. ~ CSE 352 Team 3 ~ Jason Park Evan Glover. Kevin Lui Aman Rawat. Prof. Anita Wasilewska

IMPROVEMENTS TO THE IBM SPEECH ACTIVITY DETECTION SYSTEM FOR THE DARPA RATS PROGRAM

Signal Processing for Robust Speech Recognition Motivated by Auditory Processing

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Cepstrum alanysis of speech signals

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

DSP First. Laboratory Exercise #7. Everyday Sinusoidal Signals

WARPED FILTER DESIGN FOR THE BODY MODELING AND SOUND SYNTHESIS OF STRING INSTRUMENTS

Robust telephone speech recognition based on channel compensation

AN AUDITORILY MOTIVATED ANALYSIS METHOD FOR ROOM IMPULSE RESPONSES

Applications of Music Processing

Discrete Fourier Transform (DFT)

I D I A P R E S E A R C H R E P O R T. June published in Interspeech 2008

Long Range Acoustic Classification

Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition Chanwoo Kim, Member, IEEE, and Richard M. Stern, Fellow, IEEE

Non-stationary Analysis/Synthesis using Spectrum Peak Shape Distortion, Phase and Reassignment

Detection, localization, and classification of power quality disturbances using discrete wavelet transform technique

FFT 1 /n octave analysis wavelet

RECENTLY, there has been an increasing interest in noisy

Sound Synthesis Methods

A STUDY ON CEPSTRAL SUB-BAND NORMALIZATION FOR ROBUST ASR

Evaluation of Audio Compression Artifacts M. Herrera Martinez

Advanced audio analysis. Martin Gasser

Mel- frequency cepstral coefficients (MFCCs) and gammatone filter banks

Speech Synthesis; Pitch Detection and Vocoders

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Introduction of Audio and Music

REAL-TIME BROADBAND NOISE REDUCTION

An Improved Voice Activity Detection Based on Deep Belief Networks

Discriminative Training for Automatic Speech Recognition

ON THE RELATIONSHIP BETWEEN INSTANTANEOUS FREQUENCY AND PITCH IN. 1 Introduction. Zied Mnasri 1, Hamid Amiri 1

Transcription:

DERIVATION OF TRAPS IN AUDITORY DOMAIN Petr Motlíček, Doctoral Degree Programme (4) Dept. of Computer Graphics and Multimedia, FIT, BUT E-mail: motlicek@fit.vutbr.cz Supervised by: Dr. Jan Černocký, Prof. Hynek Heřmanský ABSTRACT This contribution presents potential straightforward technique to extract temporal information in auditory domain. Even though the final phoneme accuracy is comparable to the traditional approach, it can be well suited to replace standard spectrum based techniques used in ASR systems due to higher flexibility and computational inexpensiveness. 1 INTRODUCTION Most of feature extraction methods used in current Automatic Speech Recognition (ASR) systems are based on spectrum. However, such based techniques have distinct disadvantages, because they can be easily influenced by variety of issues, such as communication channel distortions or narrowband noise. Moreover, some other supplementary techniques need to be applied to deal with realistic communication environments. Many of the noise-robust techniques employ the temporal domain processing operations to increase robustness in ASR. Psychoacoustic experiments prove that peripheral auditory system in humans integrates information of much larger time spans than the temporal duration of the frame used in traditional speech analysis. This time span is of the order of several hundred milliseconds. As the example of successive temporal domain based techniques are dynamic cepstral coefficients. These coefficients are computed as the first and second order orthogonal polynomial expansions of feature time trajectories, and are referred to as delta and acceleration coefficients, respectively. They represent the slope and curvature, respectively, of the feature trajectories, and are typically computed over 50 ms to 90 ms speech segments. Cepstral mean normalization, in which the long-term average is subtracted from the logarithmic speech spectrum, is another temporal processing technique. Recently, many progressive temporal domain processing algorithms have appeared, where conventional spectral feature in phonetic classification is substituted by a several hundred millisecond long temporal vector of critical band energies [1]. The phonetic class is defined with respect to the center of this temporal vector. The stream of these vectors

goes to the input of classifier that attempts to capture the appropriate temporal pattern (TRAP), and is called TRAP classifier (Fig. 1). Figure 1: Innovative idea of employment temporal information in ASR. In the original approach, a set of vectors representing its temporal evolution is extracted from a particular time trajectory. Critical bands are usually used as a basis of these time trajectories. In our approach we want to show that TRAPs do not have to be represented by time trajectories of spectral energy and can be derived a different way without applying any spectral processing operations. Figure 2: Derivation of TRAPs in auditory domain. 2 DERIVATION OF TEMPORAL PATTERNS TRAPs are often examined in time evolution of basic sound units, phonemes, typically used in ASR. Traditionally, the speech signal is processed as a series of independent short-time (e.g. 10 ms) frames. Each frame is transformed into spectral domain using Fourier transform, and logarithmic critical band energies are derived. In our approach, TRAPs are fully derived in auditory domain (Fig. 2). To preserve frequency independence of classification, some sort of band pass filter bank needs to be

applied. Such analysis filter bank is being represented by gammatone filters whose center frequencies and bandwidths match those of the critical bands. These linear phase gammatone filters are applied to the input signal to obtain an auditory-based time-frequency parametrization, which approximates the patterns of neural firing generated by the auditory nerve, and preserves the temporal information carried in speech. Gammatone filters can be implemented using FIR or IIR filters [3]. In our approach, FIR filters were used in order to implement linear phase filters with the same delay in each critical band. The analysis filters have a length of 2N 1 coefficients. They were obtained by convolving a sampled gammatone impulse response g(n) of length N = 128 with its time reverse, where: g(n) = a(nt ) N 1 e 2πbERB( f c)nt cos(2π f c nt + ϕ). (1) T is the sampling period, f c is the center frequency, n is the discrete sample index, a, b are constants, and ERB( f c ) is the equivalent rectangular bandwidth of an auditory filter. For an 8 khz sampled speech, 15 FIR filters were used. To extract the energy from each pass band filtered speech signal, the signal needs to be demodulated. Therefore filtered signal is multiplied by complex exponential e j2π f cnt, where j is the complex operator. Finally, low pass filter (LPF) is applied to preserve only non-modulated spectral components. Our approach is not consistent with traditional method [1] in sense of derivation temporal patterns from logarithmic critical band energies. In spectral analysis based technique, the speech signal is processed as a stream of frames in order to capture non-stationary characteristic of the speech signal (the speech is downsampled according to frame length and frame shift, and frames are then used to derive final temporal trajectories). Due to derivation of TRAPs in auditory domain, the signal is still fully sampled, so that the length of extracted TRAPs (hundred milliseconds) is largely higher than length of originally derived TRAPs. The extraction of TRAPs from demodulated signals (each critical band is processed independently) is done the same way as traditional framing. The signal is divided into segments with some overlapping constant and the appropriate segment length. Each such segment, is Hamming windowed, processed by logarithm and the mean is subtracted. Created TRAPs have finally the same segment rate as in the original approach. Temporal evolution achieved by individual TRAP is sampled with primary sampling frequency, which is F s = 8 khz in our experiments. The spectrum that can be computed from temporal trajectory of critical band is referred to as modulation spectrum. Components of this spectrum for clean speech varies approximately in period of 1 to 20 Hz. Spectral components the vary more rapidly or slowly are caused by non-speech artifacts and do not carry any efficient information. Therefore we can downsample these temporal trajectories at least by ratio 200 (modified F s will be 40 Hz) with appropriate low pass filtering. The whole previously described technique for derivation of TRAPs in auditory domain is given in Fig. 3. 3 EXPERIMENTAL SETUP To get understanding of the information that is available in the time trajectories, we examine for patterns in the temporal evolution of phonemes. Therefore phoneme labeled

Figure 3: a) Input speech sentence (F s = 8 khz). b) Spectrum of input signal passed through 5 th gammatone band pass filter with f c = 531 Hz. c) Time domain interpretation of filtered speech. d) Spectrum of filtered signal and multiplied by complex exponential. e) Impulse response of the following LPF. f) Amplitude frequency response of the following LPF. g) Filtered speech (dotted line) with demodulated energy (solid line). h) Energy extracted from band passed signal. i) Modulation spectrum related to the energy of band passed signal. j) Final TRAP - 1 sec. of energy after application of logarithm and mean normalization with downsampling ratio R = 200. database is needed for our experiments [2]. Each single critical band is classified into phonetic classes by a multi-layer perceptron (MLP) with 3 layers. The size of input layer is determined by the length of TRAP. The hidden layer with sigmoid non-linearities have 300 neurons. The size of output layer is given by the number of classes. TIMIT database with 42 phonetic classes is used to train individual band classifiers. The training data is split into a training and cross-validation (CV) sets. Outputs of band classifiers are class posteriors that are gaussianized (application of logarithm). Since there are 15 critical bands available within the speech bandwidth, we have at our disposal 15 different TRAP outputs. Then we use another MLP for combining the outputs obtained from each of the 15 TRAPs. The merger consists of 3 layers. The input to the combining network (called merger) is the concatenated vector of posteriors of the 42 phonetic classes from each of the 15 TRAPs (42 15). The hidden layer contains 300 neurons. The size of output layer is given by the number of classes (42). The merger is usually trained on different data than used for training band classifiers. We used OGI-stories corpus [1] and considered 42 phonetic classes (same as for TIMIT). Therefore, for this new training data, TRAPs must be generated and forward passed through band classifiers. The phoneme recognition accuracy for a previously described classification in each critical band is in the range of 21% - 25%. Tab. 1 shows the final phoneme recognition accuracy of the merger on the CV and train set of OGI-stories corpus. It is related to the 500

ms long TRAPs, 12.5 ms frame shift that results into 40 samples of TRAP (downsampling ratio R = 100). Technique Best CV acc. [%] Best train acc. [%] Traditional 51.49 61.01 Our 50.68 62.53 Table 1: Performance with the TRAPs. 4 CONCLUSIONS It has already been shown and published (and also successfully employed in feature extraction algorithm for ASR [4]) that information extracted from temporal trajectories can largely increase ASR performance, mainly when combined with classical features. However, the solely proposed technique was based on spectrum analysis for derivation of TRAPs. In this paper we gave a brief description of different technique for extraction temporal information employed in auditory domain. The final performance on CV subset is comparable (little bit worse for CV subset and better for train subset) to the traditional approach. These results also show that with reasonable larger train corpus we should be able to achieve higher final performance. Proposed approach is advantageous in terms of possible modifications and computational inexpensiveness. For instance, it is effortless to change the time length of created temporal segments, without touching frame shift (just downsampling ratio is modified), and so on. ACKNOWLEDGMENTS This research has been partially supported by industrial grant from Qualcomm, DARPA N66001-00-2-8901/0006, by Grant Agency of Czech Republic under project No. 102/02/0124 and by EC project Multi-modal meeting manager (M4), No. IST-2001-34485. REFERENCES [1] Hermansky H., Sharma S.: TRAPS - Classifiers of Temporal Patterns, Proceedings of ICSLP 98, Sydney, Australia, November 1998. [2] Černocký J.: TRAPS in all senses, Report of post-doc research internship in ASP Group, OGI-OHSU, September 2001. [3] Gold B., Morgan N.: Speech and Audio Signal Processing, John Wiley & sons, inc., New York, 1999. [4] Jain P., Hermansky H., Kingsbury B.: Distributed Speech Recognition Using Noise- Robust MFCC and Traps-Estimated Manner Features, Proceedings of ICSLP 02, Denver, USA, September 2002.