SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

Similar documents
Mel Spectrum Analysis of Speech Recognition using Single Microphone

Implementing Speaker Recognition

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

Speech Synthesis using Mel-Cepstral Coefficient Feature

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Isolated Digit Recognition Using MFCC AND DTW

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Voice Recognition Technology Using Neural Networks

SOUND SOURCE RECOGNITION AND MODELING

Cepstrum alanysis of speech signals

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

Advanced audio analysis. Martin Gasser

Speech Signal Analysis

Automatic Morse Code Recognition Under Low SNR

An Improved Voice Activity Detection Based on Deep Belief Networks

VOICE COMMAND RECOGNITION SYSTEM BASED ON MFCC AND DTW

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Adaptive Filters Application of Linear Prediction

Speech Synthesis; Pitch Detection and Vocoders

Applications of Music Processing

Auditory Based Feature Vectors for Speech Recognition Systems

SPEech Feature Toolbox (SPEFT) Design and Emotional Speech Feature Extraction

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

SIMULATION VOICE RECOGNITION SYSTEM FOR CONTROLING ROBOTIC APPLICATIONS

EE482: Digital Signal Processing Applications

Gammatone Cepstral Coefficient for Speaker Identification

RECENTLY, there has been an increasing interest in noisy

Basic Characteristics of Speech Signal Analysis

Introduction of Audio and Music

Electronic disguised voice identification based on Mel- Frequency Cepstral Coefficient analysis

Epoch Extraction From Emotional Speech

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

CS 188: Artificial Intelligence Spring Speech in an Hour

Audio Fingerprinting using Fractional Fourier Transform

Separating Voiced Segments from Music File using MFCC, ZCR and GMM

MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM

Identification of disguised voices using feature extraction and classification

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Audio Signal Compression using DCT and LPC Techniques

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Voice Excited Lpc for Speech Compression by V/Uv Classification

An Approach to Very Low Bit Rate Speech Coding

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Topic. Spectrogram Chromagram Cesptrogram. Bryan Pardo, 2008, Northwestern University EECS 352: Machine Perception of Music and Audio

A Method for Voiced/Unvoiced Classification of Noisy Speech by Analyzing Time-Domain Features of Spectrogram Image

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

Dominant Voiced Speech Segregation Using Onset Offset Detection and IBM Based Segmentation

Tools for Advanced Sound & Vibration Analysis

Enhanced Waveform Interpolative Coding at 4 kbps

Speech Recognition using FIR Wiener Filter

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

EVALUATION OF MFCC ESTIMATION TECHNIQUES FOR MUSIC SIMILARITY

Determination of instants of significant excitation in speech using Hilbert envelope and group delay function

Research Article Implementation of a Tour Guide Robot System Using RFID Technology and Viterbi Algorithm-Based HMM for Speech Recognition

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Overview of Code Excited Linear Predictive Coder

KONKANI SPEECH RECOGNITION USING HILBERT-HUANG TRANSFORM

Vocal Command Recognition Using Parallel Processing of Multiple Confidence-Weighted Algorithms in an FPGA

Performance analysis of voice activity detection algorithm for robust speech recognition system under different noisy environment

Voice Recognition Based Automation System for Medical Applications and For Physically Challenged Patients

Audio Similarity. Mark Zadel MUMT 611 March 8, Audio Similarity p.1/23

Monophony/Polyphony Classification System using Fourier of Fourier Transform

Audio processing methods on marine mammal vocalizations

L19: Prosodic modification of speech

Determination of Variation Ranges of the Psola Transformation Parameters by Using Their Influence on the Acoustic Parameters of Speech

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Variation in Noise Parameter Estimates for Background Noise Classification

On a Classification of Voiced/Unvoiced by using SNR for Speech Recognition

Evaluation of MFCC Estimation Techniques for Music Similarity Jensen, Jesper Højvang; Christensen, Mads Græsbøll; Murthi, Manohar; Jensen, Søren Holdt

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

Voice Activity Detection

APPLICATIONS OF DSP OBJECTIVES

Advanced Music Content Analysis

Isolated Word Recognition Based on Combination of Multiple Noise-Robust Techniques

Speech Recognition on Robot Controller

Sound Recognition. ~ CSE 352 Team 3 ~ Jason Park Evan Glover. Kevin Lui Aman Rawat. Prof. Anita Wasilewska

CO-CHANNEL SPEECH DETECTION APPROACHES USING CYCLOSTATIONARITY OR WAVELET TRANSFORM

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Real time speaker recognition from Internet radio

Text and Language Independent Speaker Identification By Using Short-Time Low Quality Signals

A Wavelet Based Approach for Speaker Identification from Degraded Speech

HIGH RESOLUTION SIGNAL RECONSTRUCTION

ON-LINE LABORATORIES FOR SPEECH AND IMAGE PROCESSING AND FOR COMMUNICATION SYSTEMS USING J-DSP

Performance Analysiss of Speech Enhancement Algorithm for Robust Speech Recognition System

Robust Algorithms For Speech Reconstruction On Mobile Devices

High-speed Noise Cancellation with Microphone Array

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

Relative phase information for detecting human speech and spoofed speech

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

Autonomous Vehicle Speaker Verification System

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Transcription:

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT RASHMI MAKHIJANI Department of CSE, G. H. R.C.E., Near CRPF Campus,Hingna Road, Nagpur, Maharashtra, India rashmi.makhijani2002@gmail.com URMILA SHRAWANKAR Department of CSE, G. H. R.C.E., Near CRPF Campus,Hingna Road, Nagpur, Maharashtra, India urmila@ieee.org Dr. V. M. THAKARE Department of CSE, S. G. B.,Amravati University, Amravati, Maharashtra, India Abstract : Acoustical mismatch among training and testing phases degrades outstandingly speech recognition results. This problem has limited the development of real-world nonspecific applications, as testing conditions are highly variant or even unpredictable during the training process. Therefore the background noise has to be removed from the noisy speech signal to increase the signal intelligibility and to reduce the listener fatigue. Enhancement techniques applied, as pre-processing stages; to the systems remarkably improve recognition results. In this paper, a novel approach is used to enhance the perceived quality of the speech signal when the additive noise cannot be directly controlled. Instead of controlling the background noise, we propose to reinforce the speech signal so that it can be heard more clearly in noisy environments.the subjective evaluation shows that the proposed method improves perceptual quality of speech in various noisy environments. As in some cases speaking may be more convenient than typing, even for rapid typists: many mathematical symbols are missing from the keyboard but can be easily spoken and recognized. Therefore, the proposed system can be used in an application designed for mathematical symbol recognition (especially symbols not available on the keyboard) in schools. Keywords: pitch ;cepstrum analysis; speech recognition; 1. Introduction The perceptual quality of speech, which is defined as the overall quality of perception measured in terms of intelligibility, clarity, and naturalness, is seriously degraded by ambient noise. Many methods for improving the perceptual quality of speech in a noisy environment have been proposed and are applied. Each method suggests the enhancement of perception-related speech features such as signal-to-noise ratio (SNR), loudness, and highband components [1] [4]. This paper proposes a new method is designed to solve the problem of unpredictable performance of conventional methods. The proposed method will be applied for recognizing mathematical symbols[5] in a school environment. ISSN : 0975-5462 Vol. 3 No. 2 Feb 2011 1764

2. Methodology Speech recognition is, in its most general form, a conversion from an acoustic waveform to a written equivalent of the message information. To process the speech signal digitally, it is necessary to make the analog waveform discrete in both time (sample) and amplitude (quantize). Speech recognition process consists of various steps such as Speech Acquisition, Speech Preprocessing, Feature Extraction, Training and Recognition. The main stress in this paper is on the second step i.e. Speech Preprocessing. During the first step i.e. speech acquisition, speech samples are obtained from the speaker in real time and stored in memory for preprocessing. Pre-processing is a critical process performed on speech input in order to develop a robust and efficient system [6]. It is mainly performed in a few stages such as A/D conversion, End point, Preemphasis and speech enhancement. The first stage is the Analog-to-Digital (A/D) conversion where the analog speech signal is converted into a digital signal. The second stage is the removal of silent segment from the captured speech signal, otherwise known as end-point detection. Endpoint detection refers to the removal of silence portion of the speech data. The two most widely used end-point detection methods in use are the shorttime energy based method (STE) and the zero-crossing method (ZCR).STE will be method implemented for this process in this project. Basically, the speech signal will be divided into 0.5ms frames and compared with the average energy of the speech signal. Frames with energy below the threshold set will be discarded. Retained frames will be combined to form the final speech data for further speech processing. The third stage in preprocessing is Pre-emphasis which is used to enhance the high frequencies of speech signal. There are two important factors for doing this: (1) To enhance the specific information in the higher frequencies of speech. (2) To negate the effect of energy decrease in higher frequencies in order to enable proper analysis on the whole spectrum of the speech signal. After pre-emphasis stage, speech enhancement techniques [7] are used based on Pitch detection.the pitch determination is very important for many speech processing algorithms. For example, the concatenative speech synthesis methods require pitch tracking on the desired speech segments if prosody modification is to be done. Pitch is also crucial for prosodic variation in text-to-speech systems and spoken language systems. In this paper, pitch detection methods using cepstrum method is used. 3. Pitch Detection via Cepstral Method Cepstral analysis provides a way for the estimation of pitch. If we assume that a sequence of voiced speech is the result of convoluting the glottal excitation sequence e[n] with the vocal tract s discrete impulse response _[n]. In frequency domain, the convolution relationship becomes a multiplication relationship. Then, using property of log function log AB = log A + log B, the multiplication relationship can be transformed into an additive relationship. Finally, the real cepstrum of a signal (1) (2) That is, the cepstrum is a Fourier analysis of the logarithmic amplitude spectrum of the signal. If the log amplitude spectrum contains many regularly spaced harmonics, then the Fourier analysis of the spectrum will show a peak corresponding to the spacing between the harmonics: i.e. the fundamental frequency. Effectively we are treating the signal spectrum as another signal, then looking for periodicity in the spectrum itself. The cepstrum is so-called because it turns the spectrum inside-out. The x-axis of the cepstrum has units of quefrency, and peaks in the cepstrum (which relate to periodicities in the spectrum) are called rahmonics. To obtain an estimate of the fundamental frequency from the cepstrum we look for a peak in the quefrency region corresponding to typical speech fundamental frequencies (1/quefrency). ISSN : 0975-5462 Vol. 3 No. 2 Feb 2011 1765

Fig. 1 Flow Graph of Cepstrum analysis Cepstral analysis separates the effects of the vocal source and vocal tract filter. speech signal can be modeled as the convolution of the source excitation and vocal tract filter, and a cepstral analysis performs deconvolution of these two components. The high-time portion of the cepstrum contains a peak value at the pitch period. Figure 1 shows a flow diagram of the cepstral pitch detection algorithm.the cepstrum of each hamming windowed block is computed. The peak cepstral value and its location are determined in the frequency range of 60 to 500 Hz as defined in the autocorrelation algorithm, and if the value of this peak exceeds a fixed threshold, the section is classified as voiced and the pitch period is the location of the peak. If the peak does not exceed the threshold, a zero-crossing count is made on the block. If the zero-crossing count exceeds a given threshold, the window is classified as unvoiced.unlike autocorrelation pitch detection algorithm which uses a low-passed speech signal, cepstral pitch detection uses the full-band speech signal for processing. Figure 2 shows the result of applying cepstrum method of pitch detection on the sample wav file. Fig. 2 Result of Cepstrum analysis on speech signal 4. Voice features extraction Voice feature extraction, otherwise known as front end processing is performed in both recognition and training mode. Feature extraction converts digital speech signal into sets of numerical descriptors called feature vectors that contain key characteristics of the speech signal. Evaluation of the different types of feature extracted from voice to determine their suitability for recognition is the third step for this paper. The current most popular and widely known features used are the Linear Prediction Coefficients (LPC), Linear Prediction Cepstral ISSN : 0975-5462 Vol. 3 No. 2 Feb 2011 1766

Coefficients (LPCC) & Mel-Frequency Cepstral Coefficients (MFCC)[8]. As such, in this paper the MFCC is used for feature extraction. 4.1. Mel-Frequency Cepstral Coefficients (MFCC) Mel-frequency Cepstral coefficient is one of the most prevalent and popular method used in the field of voice feature extraction. The difference between the MFC and cepstral analysis is that the MFC maps frequency components using a Mel scale modeled based on the human ear perception of sound instead of a linear scale [9]. The Mel-frequency cepstrum represents the short-term power spectrum of a sound using a linear cosine transform of the log power spectrum of a Mel scale. The formula for the Mel scale is (3) 3500 Linear Frequency vs Mel Frequency 3000 Mel Frequency(mels) 2500 2000 1500 1000 500 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Frequency(Hz) Fig. 3 Mel Scale plot Vergin [10] mentioned that MFCC as frequency domain parameters are much more consistent and accurate than time domain features. Vergin [10] listed the steps leading to extraction of MFCCs: Fast Fourier Transform, filtering and cosine transform of the log energy vector. According to Vergin [11], MFCCs can be obtained by the mapping of an acoustic frequency to a perceptual frequency scale called the Mel scale. MFCCs are computed by taking the windowed frame of the speech signal, putting it through a Fast Fourier Transform (FFT) to obtain certain parameters and finally undergoing Mel-scale warping to retrieve feature vectors that represents useful logarithmically compressed amplitude and simplified frequency information. Seddik [12] mentioned that MFCC are computed by applying discrete cosine transform to the log of the Mel-filter bank. The results are features that describe the spectral shape of the signal. Rashidul describe the main steps for extraction of MFCC, shown on figure 4. The main steps are as follow: pre-emphasis, framing, windowing, perform Fourier fast transform FFT), Mel frequency warping, filter bank, logarithm, discrete Cosine transform (DCT). ISSN : 0975-5462 Vol. 3 No. 2 Feb 2011 1767

Speech Input A/D Pre emphasis Framing/ Windowing Fourier Transform Melfrequency wrapping Logarithm Discrete Cosine Transform Mel-Frequency Cepstral Coefficients Figure 4 Block diagram of Mel-Frequency Cepstral Coefficient MFCC uses banks of filters to wrap the frequency spectrum onto the Mel-scale that is similar to how the human ear perceives sound. The filters of the Mel-scale are linear at low frequencies but logarithm at high frequencies to imitate the human hearing perception. For this project, the filters of the mfcc will be adapted from Voicebox: Speech Processing Toolbox for MATLAB by Mike Brooks. In this paper, MFCC are extracted by passing the frames of the windowed speech signal into the mfcc.m function written. Figure 5 shows the result of applying LPCC & MFCC method on the speech signal. Figure 5 Result of applying LPCC & MFCC method of Feature Extraction on speech signal. The main advantage of MFCC is the robustness towards noise and spectral estimation errors under various conditions [13]. A. Reynolds did a study on the comparison of different features and found that the MFCC provides better performance than other features [14]. ISSN : 0975-5462 Vol. 3 No. 2 Feb 2011 1768

5. Speaker Modeling The next step after feature extraction is to generate patterns models for feature matching. In the training or recognition mode, speech models are built using the specific voice features extracted from the current speech samples. In the recognition mode, the speech model is used to compare with the current samples for identification or verification purposes. Three main types of modeling techniques are available, namely: template matching, stochastic modeling, neural networks. Various concepts were introduced under these techniques such as pattern matching (Dynamic Time Warping) which does direct template matching between training and testing subject. However, direct template matching is time consuming when the number of feature vectors increase. Clustering is a method to reduce the number of feature vectors by using a codebook to represent centres of the feature vectors (Vector Quantization). The LBG (Linde, Buzo and Gray) algorithm [15] and the k-means algorithm are some of the most well known algorithms for Vector Quantization (VQ). Other methods proposed includes neural networks and also stochastic models that use probability distribution such as Hidden Markov Model (HMM) and the Gaussian Mixture Model (GMM). In this project, the training models are generated using the Vector Quantization-LBG method. The speech feature coefficients are passed into the function to generate the codebook. The rationale for choosing the VQ-LBG method is the ease of implementation and comparable performance to other methods. The square Euclidean distance measurement for speech similarity measure will be used for testing. References [1] B. Sauert and P. Vary, Near end listening enhancement: Speech intelligibility improvement in noisy environments, in Proc. ICASSP, 2006,pp. 493 496. [2] J. Shin and N. Kim, Perceptual reinforcement of speech signal based on partial specific loudness,, IEEE Signal Process.Lett, vol. 14, pp.887 890, 2007. [3] P. Shankar and S. Park, Speech intelligibility enhancement using tunable equalization filter, in Proc. ICASSP, 2007, pp. 613 616. [4] Ayaz Keerio, Bhargav Kumar Mitra, Philip Birch, Rupert Young, and Chris Chatwin. "On Preprocessing of Speech Signals". International Journal of Signal Processing ; Vol.5 No.3 2009 [Page 216]. [5] Urmila Shrawankar, Hurdles for designing an application based on Speech Interface, IEEE-ICET 06,NWFP University of Engineering & Technology Peshawar,Pakistan,13-14 Nov 06. [6] Zhonghua, Fu and Zhao Rongchun. "An overview of modeling technology of speech recognition,". Neural Networks and Signal Processing, 2003. Proceedings of the 2003 International Conference on, vol.2, no., pp. 887-891 Vol.2, 14-17 Dec. 2003. [7] R. Drullman, J. Festen, and R. Plomp, Effect of temporal envelope smearing on speech reception, J. Acoust. Soc. Amer., vol. 95, no. 2, pp. 1053 1064, Feb. 1994. [8] Urmila Shrawankar,Dr. V.M. Thakare, Feature Extraction for a speech recognition system in a noisy environment:a study ICCEA 2010,Indonesia,March 19-21,2010. [9] Md. Rashidul Hasan, Mustafa Jamil, Md. Golam Rabbani, Md. Saifur Rahman. "Speaker Identification using Mel Frequency cepstral coefficients". 3rd International Conference on Electrical & Computer Engineering ICECE 2004, 28-30 December 2004, Dhaka, Bangladesh [10] Vergin, R.,. "An algorithm for robust signal modelling in speech recognition,". Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, vol.2, no., pp.969-972 vol.2, 12-15 May 1998. [11]. Vergin, R., O'Shaughnessy, D. and Farhat, A.,. "Generalized mel frequency cepstral coefficients for large-vocabulary speakerindependent continuous-speech recognition,". Speech and Audio Processing, IEEE Transactions on, vol.7, no.5, pp.525-532, Sep 1999. [12] Seddik, H., Rahmouni, A. and Sayadi, M.,. "Text independent speaker recognition using the Mel frequency cepstral coefficients and a neural network classifier"., Communications and Signal Processing, 2004. First International Symposium on, vol., no., pp. 631-634, 2004. 2004. [13] Molau, S., et al. "Computing Mel-frequency cepstral coefficients on the power spectrum,". Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP '01). 2001 IEEE International Conference on, vol.1, no., pp.73-76 vol. [14] Reynolds, D.A.,. "Experimental evaluation of features for robust speaker identification,". Speech and Audio Processing, IEEE Transactions on, vol.2, no.4, pp.639-643, Oct 1994. [15] Kinghorn, M. Greenwood and A. Suving: Automaticsilence/unvoiced/voiced classification of speech,. Departmentof Computer Science, The University ofsheffield, 1999 ISSN : 0975-5462 Vol. 3 No. 2 Feb 2011 1769