Gammatone Cepstral Coefficient for Speaker Identification

Similar documents
Mel Spectrum Analysis of Speech Recognition using Single Microphone

Auditory Based Feature Vectors for Speech Recognition Systems

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Electronic disguised voice identification based on Mel- Frequency Cepstral Coefficient analysis

Audio Fingerprinting using Fractional Fourier Transform

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Isolated Digit Recognition Using MFCC AND DTW

Introduction of Audio and Music

Dominant Voiced Speech Segregation Using Onset Offset Detection and IBM Based Segmentation

Speech Signal Analysis

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Mel- frequency cepstral coefficients (MFCCs) and gammatone filter banks

Speech Recognition using FIR Wiener Filter

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

Testing of Objective Audio Quality Assessment Models on Archive Recordings Artifacts

An Optimization of Audio Classification and Segmentation using GASOM Algorithm

FPGA implementation of DWT for Audio Watermarking Application

Cepstrum alanysis of speech signals

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Design and Implementation of an Audio Classification System Based on SVM

The psychoacoustics of reverberation

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

Speech Synthesis using Mel-Cepstral Coefficient Feature

AN AUDITORILY MOTIVATED ANALYSIS METHOD FOR ROOM IMPULSE RESPONSES

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

Voice Recognition Technology Using Neural Networks

An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation

Identification of disguised voices using feature extraction and classification

MOST MODERN automatic speech recognition (ASR)

Sound Recognition. ~ CSE 352 Team 3 ~ Jason Park Evan Glover. Kevin Lui Aman Rawat. Prof. Anita Wasilewska

Applications of Music Processing

CHAPTER 2 FIR ARCHITECTURE FOR THE FILTER BANK OF SPEECH PROCESSOR

Change Point Determination in Audio Data Using Auditory Features

Calibration of Microphone Arrays for Improved Speech Recognition

Speech/Music Change Point Detection using Sonogram and AANN

Adaptive Filters Application of Linear Prediction

Auditory modelling for speech processing in the perceptual domain

Using RASTA in task independent TANDEM feature extraction

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

An Improved Voice Activity Detection Based on Deep Belief Networks

Overview of Code Excited Linear Predictive Coder

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Enhancement of Speech in Noisy Conditions

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Separating Voiced Segments from Music File using MFCC, ZCR and GMM

Image Forgery Detection Using Svm Classifier

Robust Speaker Identification for Meetings: UPC CLEAR 07 Meeting Room Evaluation System

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

Digital Audio Watermarking With Discrete Wavelet Transform Using Fibonacci Numbers

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

MULTIPLE F0 ESTIMATION IN THE TRANSFORM DOMAIN

Audio processing methods on marine mammal vocalizations

Speech and Music Discrimination based on Signal Modulation Spectrum.

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

Isolated Word Recognition Based on Combination of Multiple Noise-Robust Techniques

Determination of Variation Ranges of the Psola Transformation Parameters by Using Their Influence on the Acoustic Parameters of Speech

Biomedical Signals. Signals and Images in Medicine Dr Nabeel Anwar

Robust speech recognition using temporal masking and thresholding algorithm

Multimedia Signal Processing: Theory and Applications in Speech, Music and Communications

Decriminition between Magnetising Inrush from Interturn Fault Current in Transformer: Hilbert Transform Approach

Drum Transcription Based on Independent Subspace Analysis

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Speech/Music Discrimination via Energy Density Analysis

Environmental Sound Recognition using MP-based Features

Time-Frequency Distributions for Automatic Speech Recognition

Voice Activity Detection

RECENTLY, there has been an increasing interest in noisy

Advanced audio analysis. Martin Gasser

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

SIMULATION VOICE RECOGNITION SYSTEM FOR CONTROLING ROBOTIC APPLICATIONS

VOICE COMMAND RECOGNITION SYSTEM BASED ON MFCC AND DTW

Audio Signal Compression using DCT and LPC Techniques

Campus Location Recognition using Audio Signals

T Automatic Speech Recognition: From Theory to Practice

Determining Guava Freshness by Flicking Signal Recognition Using HMM Acoustic Models

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Auditory motivated front-end for noisy speech using spectro-temporal modulation filtering

KONKANI SPEECH RECOGNITION USING HILBERT-HUANG TRANSFORM

1. Introduction. Keywords: speech enhancement, spectral subtraction, binary masking, Gamma-tone filter bank, musical noise.

Perceptive Speech Filters for Speech Signal Noise Reduction

High-speed Noise Cancellation with Microphone Array

Epoch Extraction From Emotional Speech

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Automatic Morse Code Recognition Under Low SNR

An Audio Fingerprint Algorithm Based on Statistical Characteristics of db4 Wavelet

Autonomous Vehicle Speaker Verification System

Speech Compression Using Voice Excited Linear Predictive Coding

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

ECE 556 BASICS OF DIGITAL SPEECH PROCESSING. Assıst.Prof.Dr. Selma ÖZAYDIN Spring Term-2017 Lecture 2

Topic. Spectrogram Chromagram Cesptrogram. Bryan Pardo, 2008, Northwestern University EECS 352: Machine Perception of Music and Audio

Transcription:

Gammatone Cepstral Coefficient for Speaker Identification Rahana Fathima 1, Raseena P E 2 M. Tech Student, Ilahia college of Engineering and Technology, Muvattupuzha, Kerala, India 1 Asst. Professor, Ilahia college of Engineering and Technology, Muvattupuzha, Kerala, India 2 Abstract: Digital processing of speech signal and voice recognition algorithm is very important for fast and accurate automatic voice recognition technology. The voice is a signal of infinite information. A direct analysis and synthesizing the complex voice signal is due to too much information contained in the signal. Taking as a basis Mel frequency cepstral coefficients (MFCC) used for speaker identification and audio parameterization, the Gammatone cepstral coefficients (GTCCs) are a biologically inspired modification employing Gammatone filters with equivalent rectangular bandwidth bands. A comparison is done between MFCC and GTCC for speaker identification.thier performance is evaluated using three machine learning methods neural network (NN) and support vector machine (SVM) and K-nearest neighbor (KNN). According to the results, classification accuracies are significantly higher when employing GTCC in speaker identification than MFCC. Keywords: Feature extraction, Feature matching, Gammatone Cepstral coefficient, Speaker identification I. INTRODUCTION The speaker recognition has always focused on security system of controlling the access to control data or information being accessed by any one. Speaker recognition is the process of automatically recognizing the speaker voice according to the basis of individual information in the voice waves. Speaker identification is the process of using the voice of speaker to verify their identity and control access to services such as voice dialling, mobile banking, data base access services, voice mail or security control to a secured system. The recognition and classification of audio information have multiple applications [1].The identification of the audio context for portable devices, which could allow the device to automatically adapt to the surrounding environment without human intervention [2]. In robotics this technology might be employed to make the robot interact with the environment, even in the absence of light, and there are surveillance and security system that make use of the audio information either by itself or in combination with video information [1]. II. PRINCIPLE OF VOICE RECOGNITION A. Speaker Recognition Algorithms A voice analysis is done after taking an input through microphone from a user. The design of the system involves manipulation of the input audio signal. At different levels, different operations are performed on the input signal such as Windowing, Fast Fourier Transform, GT Filter Bank, Log function and discrete cosine transform. The speaker algorithms consist of two distinguished phases. The first one is training sessions, whilst, the second one is referred to as operation session or testing phase. B. Gammatone Filter Properties Gammatone function models the human auditory filter response. The correlation between the impulse response of the gammatone filter and the one obtained from the mammals was demonstrated in [3]. It is observed that the properties of frequency selectivity of the cochlea and those psychophysiclly measured in human beings seems to converge, since: 1) the magnitude response of a fourth-order GT filter is very similar to reox function [4], and 2) the filter bandwidth Copyright to IJAREEIE www.ijareeie.com 540

corresponds to a fixed distance on the basilar membrane. An nth-order GT filter can be approximated by a set of n firstorder GT filter placed in cascade, which have an efficient digital implementation. C. Gammatone Cepstral Coefficients Gammatone cepstral coefficients computation process is analogous to MFCC extraction scheme. The audio signal is first windowed into short frames, usually of 10 50 ms. This process has a twofold purpose 1) the (typically) nonstationary audio signal can be assumed to be stationary for such a short interval, thus facilitating the spectro-temporal signal analysis; and 2) the efficiency of the feature extraction process is increased [1]. Subsequently, the GT filter bank (composed of the frequency responses of the several GT filters) is applied to the signal s fast Fourier transform (FFT), emphasizing the perceptually meaningful sound signal frequencies.1 Indeed, the design of the GT filter bank is the object of study in this work, taking into account characteristics such as: total filter bank bandwidth, GT filter order, ERB model (Lyon, Greenwood, or Glasberg and Moore), and number of filters. Finally, the log function and the discrete cosine transform (DCT) are applied to model the human loudness perception and decorrelate the logarithmiccompressed filter outputs, thus yielding better energy compaction. The overall computation cost is almost equal tothe MFCC computation [5]. Fig. 1 Block diagram describing the computation of the adapted Gammatone cepstral coefficients, where stands for the GT filter order, the filter bank bandwidth, the equivalent rectangular bandwidth model, the number of GT filters, and for the number of cepstral coefficients D. Feature Extraction The extraction of the best parametric representation of acoustic signals is an important task to produce a better recognition perfomance. The efficiency of this phase is important for the next phase since it affects its behavior. Step 1: Windowing The audio samples are first windowed (with a Hamming window) into 30 ms long frames with an overlap of 15 ms. The frequency range of analysis is set from 20 Hz (minimum audible frequency) to the Nyquist frequency (in this work, 11 KHz). This process has a twofold purpose 1) the (typically) non-stationary audio signal can be assumed to be stationary for such a short interval, thus facilitating the spectro-temporal signal analysis; and 2) the efficiency of the feature extraction process is increased. The Hamming window equation is given as: If the window is defined as W (n), 0 n N-1 where N = number of samples in each frame Y[n] = Output signal X (n) = input signal W (n) = Hamming window, then the result of windowing signal is shown below: Y (n) = X (n) * W (n) W (n) = 0.54 0.46 cos [ 2Пn / N-1 ] 0 n N 1 Copyright to IJAREEIE www.ijareeie.com 541

Step 2: GT Filter Bank The GT filter bank composed of the frequency responses of the several GT filters. It is applied to the signal s fast Fourier transform (FFT), emphasizing the perceptually meaningful sound signal frequencies [5]. Step 3: Fast Fourier Transform Fig. 2 Filter Bank output To convert each frame of N samples from time domain into frequency domain. Y (w) = FFT [h (t) * X (t )] = H (w ) * X(w) If X (w), H (w) and Y (w) are the Fourier Transform of X (t), H (t) and Y (t) respectively. Step 4: Discrete Cosine Transform The log function and the discrete cosine transform (DCT) are applied to model the human loudness perception and decorrelate the logarithmic-compressed filter outputs, thus yielding better energy compaction. III. METHODOLOGY Voice recognition works based on the premise that a person voice exhibits characteristics are unique to different speaker. The signal during training and testing session can be greatly different due to many factors such as people voice change with time, health condition (e.g. the speaker has a cold), speaking rate and also acoustical noise and variation recording environment via microphone [6]. Table I gives detail information of recording and training session. TABLE I TRAINING REQUIREMENT Process Speaker Tools Environment Sampling Frequency, fs Description Three Female Two Male Mono microphone Matlab software Laboratory 8000Khz Copyright to IJAREEIE www.ijareeie.com 542

IV. EXPERIMENTAL EVALUATION A. Audio Database Speech samples are taken from five persons. From each person 50 to 60 samples are taken.the length of the speech samples was experimentally set as 4s. TABLE II AUDIO DATABASE Speaker Samples Speaker 1 61 Speaker 2 50 Speaker 3 59 Speaker 4 50 Speaker 5 50 B. Experimental Setup The speech samples are first windowed (with a Hamming window) into 30 ms long frames with an overlap of 15 ms, as done in [7]. The frequency range of analysis is set from 20 Hz (minimum audible frequency) to the Nyquist frequency (in this work, 11 KHz). Subsequently, audio samples are parameterized by means of GTCC (both the proposed adaptation and previous speech-oriented implementations [4] [8]) and other state-of-the art features (MFCC and MPEG-7). MFCC are computed following their typical implementation [9].With regard to MPEG-7 parameterization, we consider the Audio Spectrum Envelope (since it was the MPEG-7 low level descriptor attaining the best performance for non-speech audio recognition in which is converted to decibel scale, then level-normalized with the RMS energy [9], and finally compacted with the DCT. Rather than performing the audio classification at frame-level, we consider complete audio patterns extracted after analyzing the whole 4 s-sound samples at frame-level. With reference to these kinds of sounds, it is of great relevance to consider the signal time evolution (including envelope shape, periodicity and consistency of temporalchanges across frequency chanels). Subsequently, the audio patterns obtained are compacted by calculating the mean feature vector over different intervals [8]. The main purpose of this process is to make the classification problem affordable without losing the feature space interpretability, which would happen if considering, for example, principal component analysis or independent component analysis [7]. This requirement is especially important, since we are mainly interested in determining the rationale behind the performance of GTCC in contrast to other state-of-the-art audio features. Regarding the classification system, three machine learning methods are used for completeness: 1) a neural network (NN), and more specifically, a multilayer perceptron with one hidden layer; and 2) a support vector machine (SVM) with a radial basis function kernel and one versus all multiclass approach and K-nearest neighbor [8]. The audio patterns are divided into train and test data sets using a 10 10-fold cross validation scheme to yield statistically reliable results. Within each fold, the samples used for training are different from those used for testing. In addition, the last experiment employs a 4-fold cross validation scheme with a different setup, whose aim is to test the generalization capability of the features. The classification accuracy is computed as the averaged percentage of the testing samples co rectly classified by each machine learning method. V. EXPERIMENTAL RESULTS A. GTCC Adjustment The first experiment is conducted so as to adjust the GTCC computation for non-speech audio classification purposes. For each parameter (i.e., total filter bank bandwidth, GT filter order, ERB model, and number of filters), the value maximizing the classification accuracy is selected. Firstly, the positive effect of enlarging the filter bank bandwidth (with extensions both on the low and high frequencies) from the typical bandwidth employed in speech is demonstrated [9]. Secondly, the fourth, sixth, and eighth GT filter orders show very similar behavior. Among them, Copyright to IJAREEIE www.ijareeie.com 543

fourth-order filters are selected given their lower computational cost. Thirdly, it is observed that both Greenwood and Glasberg and Moore ERB models attain a better performance than Lyon s. Between them, Glasberg and Moore are selected. Finally, N=48 filters are chosen, as a good trade-off between classification accuracy and filter bank complexity. B. Features Comparison In the following experiment, the proposed GTCC for the speacker recognition is done using three machine learning methods, neural network, support vector machine and K-nearest neighbor. GTCC and MFCC show comparable results when using the SVM. GTCC yielded a notably higher accuracy for both the KNN and the NN. TABLE III RESULT OBTAINED IN MFCC AND GTCC Number of correct samples Speaker MFCC GTCC SVM NN KNN SVM NN KNN Speaker 1 40 57 60 40 60 61 Speaker 2 41 50 48 41 50 49 Speaker 3 40 54 53 40 57 54 Speaker 4 38 48 46 50 48 48 Speaker 5 51 47 47 40 49 50 C. Performance Comparison GTCC versus MFCC Accuracy improvement yielded by GTCC with respect to MFCC is analysed. This improvement is calculated as the difference between the classification rates attained for each machine learning method. It should be noted that, in order to yield a fair comparison, the bandwidth of analysis (20 Hz-11 KHz), number of filters (48), and number of Cepstral coefficients (13) were identically set in both GTCC and MFCC. The GTCC performs notably better than the MFCC. Sounds like animals, birds show an important accuracy improvement. All sounds whose classification accuracy is improved when using GTCC share some spectral similarities, as they present particular components in the low part of the spectrum, i.e., below 1 KHz. Copyright to IJAREEIE www.ijareeie.com 544

15 SVM GTCC Accuracy improvement 10 5 0-5 -10 Speaker 1 Speaker 2 Speaker 3 Speaker 4 NN KNN Speaker 5-15 Fig. 3 GTCC accuracy improvement VI. CONCLUSION GTCC borrowed from the non-speech research field, have been adapted for the speaker identification. This paper has discussed voice recognition algorithm with three machine learning methods (Neural network, Support vector machine and K-nearest neighbor) which are important in improving the voice recognition performance. The technique was able to authenticate the particular speaker based on the individual information that was included in the voice signal. The results show that these techniques could be used effectively for voice recognition purposes. However, there is still room for further improvement through investigating the temporal properties of the audio signals, combining GTCC with other signal features and implementing the technique in real-time on multimedia portable devices. REFERENCES [1] S. Chu, S. Narayanan, and C.-C. J. Kuo, Environmental sound recognition with time-frequency audio features, IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 6, pp. 1142 1158, Aug. 2009. [2] Cheong Soo Yee and abdul Manan ahmad, Malay Language Text Independent Speaker Vertification using NN MLP classsifier with MFCC, 2008 international Conference on Electronic Design. [3] A. Rabaoui, M. Davy, S. Rossignol, and N. Ellouze, Using one-class svms and wavelets for audio surveillance, IEEE. Trans. Inf. Forensics Security, vol. 3, no. 4, Dec. 2008 [4] J. Holdsworth, I. Nimmo-Smith, R. D. Patterson, and P. Rice, Spiral Vos Final Report, Part A: The Auditory Filter bank (Annex C), 1988, APU rep. 2341. [5] O. Cheng, W. Abdulla, and Z. Salcic, Performance evaluation of front-end algorithms for robust speech recognition, in Proc. ISSPA, 2005. [6] Chunsheng Fang, From Dynamic time warping (DTW) to Hidden Markov Model (HMM), University of Cincinnati,2009. [7] http://www.cse.unsw.edu.au/~waleed/phd/html/node38.html, downloaded on 3rd March 2010. [8] W. Abdullah, Auditory based feature vectors for speech recognition systems, in Advances in Communications and Software Technologies, N. E.Mastorakis and V. V. Kluev, Eds. Greece:WSEAS Press, 2002, pp. 231 236 [9] H. Kim, N. Moreau, and T. Sikora, MPEG-7 Audio and Beyond. Audio Content Indexing and Retrieval. New York: Wiley, 2005. Copyright to IJAREEIE www.ijareeie.com 545