ScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking

Similar documents
Dominant Voiced Speech Segregation Using Onset Offset Detection and IBM Based Segmentation

Monaural and Binaural Speech Separation

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

IN a natural environment, speech often occurs simultaneously. Monaural Speech Segregation Based on Pitch Tracking and Amplitude Modulation

MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction

Mel Spectrum Analysis of Speech Recognition using Single Microphone

MMSE STSA Based Techniques for Single channel Speech Enhancement Application Simit Shah 1, Roma Patel 2

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Available online at ScienceDirect. Anugerah Firdauzi*, Kiki Wirianto, Muhammad Arijal, Trio Adiono

Robust Low-Resource Sound Localization in Correlated Noise

Harmonics Enhancement for Determined Blind Sources Separation using Source s Excitation Characteristics

Accurate Delay Measurement of Coded Speech Signals with Subsample Resolution

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Modulation Domain Spectral Subtraction for Speech Enhancement

ScienceDirect. 1. Introduction. Available online at and nonlinear. c * IERI Procedia 4 (2013 )

Available online at ScienceDirect. Procedia Computer Science 54 (2015 )

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

ScienceDirect. A Novel DWT based Image Securing Method using Steganography

EE 464 Short-Time Fourier Transform Fall and Spectrogram. Many signals of importance have spectral content that

The psychoacoustics of reverberation

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

Suppression of Peak Noise Caused by Time Delay of the Anti- Noise Source

A New Framework for Supervised Speech Enhancement in the Time Domain

An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Comparative Performance Analysis of Speech Enhancement Methods

REpeating Pattern Extraction Technique (REPET)

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Determination of Pitch Range Based on Onset and Offset Analysis in Modulation Frequency Domain

ScienceDirect. Accuracy of Jitter and Shimmer Measurements

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Pitch Detection Algorithms

Perceptual Speech Enhancement Using Multi_band Spectral Attenuation Filter

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

Phase estimation in speech enhancement unimportant, important, or impossible?

Available online at ScienceDirect. Physics Procedia 70 (2015 )

MINUET: MUSICAL INTERFERENCE UNMIXING ESTIMATION TECHNIQUE

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients

Audio Signal Compression using DCT and LPC Techniques

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Timbral Distortion in Inverse FFT Synthesis

Improving speech intelligibility in binaural hearing aids by estimating a time-frequency mask with a weighted least squares classifier

Subband Analysis of Time Delay Estimation in STFT Domain

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

Electronic disguised voice identification based on Mel- Frequency Cepstral Coefficient analysis

Estimation of Non-stationary Noise Power Spectrum using DWT

Acoustic Beamforming for Hearing Aids Using Multi Microphone Array by Designing Graphical User Interface

Enhancement of Speech in Noisy Conditions

Improving reverberant speech separation with binaural cues using temporal context and convolutional neural networks

Audio Restoration Based on DSP Tools

KONKANI SPEECH RECOGNITION USING HILBERT-HUANG TRANSFORM

Discrete Fourier Transform (DFT)

Audio Engineering Society Convention Paper Presented at the 110th Convention 2001 May Amsterdam, The Netherlands

A classification-based cocktail-party processor

ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

1856 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 7, SEPTEMBER /$ IEEE

TRANSIENT NOISE REDUCTION BASED ON SPEECH RECONSTRUCTION

THE STATISTICAL ANALYSIS OF AUDIO WATERMARKING USING THE DISCRETE WAVELETS TRANSFORM AND SINGULAR VALUE DECOMPOSITION

Enhanced Waveform Interpolative Coding at 4 kbps

Two-channel Separation of Speech Using Direction-of-arrival Estimation And Sinusoids Plus Transients Modeling

HUMAN speech is frequently encountered in several

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

Introduction of Audio and Music

arxiv: v1 [cs.it] 9 Mar 2016

A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation

Design and Implementation of an Audio Classification System Based on SVM

I D I A P R E S E A R C H R E P O R T. June published in Interspeech 2008

Speech Signal Analysis

CO-CHANNEL SPEECH DETECTION APPROACHES USING CYCLOSTATIONARITY OR WAVELET TRANSFORM

Emanuël A. P. Habets, Jacob Benesty, and Patrick A. Naylor. Presented by Amir Kiperwas

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

Isolated Digit Recognition Using MFCC AND DTW

ONLINE REPET-SIM FOR REAL-TIME SPEECH ENHANCEMENT

Boldt, Jesper Bünsow; Kjems, Ulrik; Pedersen, Michael Syskind; Lunner, Thomas; Wang, DeLiang

A Method for Voiced/Unvoiced Classification of Noisy Speech by Analyzing Time-Domain Features of Spectrogram Image

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Binaural Hearing. Reading: Yost Ch. 12

IMPROVED COCKTAIL-PARTY PROCESSING

Frequency Domain Analysis for Noise Suppression Using Spectral Processing Methods for Degraded Speech Signal in Speech Enhancement

Speech detection and enhancement using single microphone for distant speech applications in reverberant environments

Significance of a low noise preamplifier and filter stage for under water imaging applications

Drum Transcription Based on Independent Subspace Analysis

Terminology (1) Chapter 3. Terminology (3) Terminology (2) Transmitter Receiver Medium. Data Transmission. Direct link. Point-to-point.

AN ANALYSIS OF ITERATIVE ALGORITHM FOR ESTIMATION OF HARMONICS-TO-NOISE RATIO IN SPEECH

8.3 Basic Parameters for Audio

Speech Enhancement Based on Non-stationary Noise-driven Geometric Spectral Subtraction and Phase Spectrum Compensation

A CASA-Based System for Long-Term SNR Estimation Arun Narayanan, Student Member, IEEE, and DeLiang Wang, Fellow, IEEE

Performance Analysis on frequency response of Finite Impulse Response Filter

ADSP ADSP ADSP ADSP. Advanced Digital Signal Processing (18-792) Spring Fall Semester, Department of Electrical and Computer Engineering

Sound Synthesis Methods

Transcription:

Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 46 (2015 ) 122 126 International Conference on Information and Communication Technologies (ICICT 2014) Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking Lekshmi M S a,*,sathidevi P S b a-b Department ECE, NIT Calicut, Kerala-67360,India Abstract Speech undergoes various acoustic interferences in natural environment, while many the applications require an effective way to separate the dominant signal from the interference. In this paper, a Short-time Fourier Transform (STFT) based unsupervised method for single channel speech separation is proposed. It uses the pitch information the dominant and interfering speakers and then generating a time frequency mask based on the pitch frequencies. Through rigorous objective and subjective evaluations, it is shown that the proposed system is capable providing better Signal to Noise Ratio (SNR) and Perceptual Evaluation Speech Quality (PESQ) compared to other related methods available in the literature. 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license 2014 The Authors. Published by Elsevier B.V. (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review Peer-review under under responsibility responsibility organizing organizing committee committee the the International International Conference Conference on on Information Information and and Communication Communication Technologies (ICICT 2014) 2014). Keywords:CASA; pitch; IBM. 1. Introduction Two major problems being faced by hearing impaired persons are difficulty in understanding speech when contaminated with other speech signals and difficulty in understanding fast speech. Hence, separation dominant speech from a mixture and its amplification will be very helpful for such persons. Computational Auditory Scene Analysis (CASA) is an emerging field signal processing aimed at developing computational system to simulate human auditory system. One the main goals CASA is speech * Corresponding author. Tel.: 91-949-636-9684. E-mail address:lekshmims@gmail.com 1877-0509 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility organizing committee the International Conference on Information and Communication Technologies (ICICT 2014) doi:10.1016/j.procs.2015.02.002

M.S. Lekshmi and P.S. Sathidevi / Procedia Computer Science 46 ( 2015 ) 122 126 123 segregation. There are two approaches for speech segregation - unsupervised and model based methods. In model based method the system applies the learned knowledge the speaker, but in the former method the system only receives the mixture signal as the input. Such systems extract the features from the mixture and these features are used as cues for segregating the speech. In this paper, separation dominant speech by using an unsupervised method, which is well suited for hearing aid applications, is proposed. The most important cues used in this work are the pitch frequencies dominant and interfering speakers. Here, a computationally efficient method for the pitch estimation the interfering speakers and separation dominant speech from a speech mixture using the pitch information is proposed. This method exhibits superior performance in terms signal to noise ratio when compared with the other systems available in the literature. 2. System Overview The input speech mixture is first decomposed into its time frequency representation using STFT. Decomposed signal is then applied to the pitch determination block which determines the pitch dominant and interfering speakers. It also identifies the gender the speakers using the estimated pitch range 7. After identifying the pitch the interfering speaker, a binary mask is created and it is used for the segregation speech (Time frequency domain). Then it is re synthesized using Inverse STFT. Input Mixture STFT Pitch Estimation Speech Segregation Resynthesis Segregated Dominant Speech Fig 1: Basic block diagram the proposed system 2.1 PitchEstimation For the pitch estimation, an autocorrelation method 2 is adopted here. The input signal is separated into two channels, below and above 1 khz. For performing channel separation we have implemented filters with 12dB per computation consists a discrete Fourier transform (DFT), magnitude compression the spectral representation, and an inverse transform (IDFT). The signal x 2 corresponds to the summary autocorrelation function ( SACF) and is obtained as The value k should be 2 for obtaining autocorrelation, but experimentally k=1.67 gives better peak values representing pitch. The autocorrelation output from each channel is summed to get the SACF. The peaks in the SACF curve produced at the output the model are good indicators potential pitch periods in the signal. SACF is further enhanced by clipping the SACF to its positive values and it is up sampled by a factor two, the up sampled signal is subtracted from the original clipped one and the resulting signal is again clipped to its positive values. Time lag corresponding to the peak value the enhanced SACF (ESACF) gives the pitch the dominant speaker. Using the above pitch analysis metho frame number. From among these pitch frequencies most frequently occurring value is considered as the dominant pitch (P d ). For identifying the pitch the interfering speaker, the pitch values are sorted according to their frequency (1)

124 M.S. Lekshmi and P.S. Sathidevi / Procedia Computer Science 46 ( 2015 ) 122 126 occurrences in frames. The dominant pitch P d is compared with subsequent frequently occurring pitch values by computing the difference between the two. The frequently occurring pitch value with difference more than 10 is considered as the pitch the interfering speaker (P I ). After determining the pitch dominant and interfering speakers, the gender the speakers are identified : if the pitch the speaker is in between 80 and 160 then it is considered as a male speaker and if the pitch is in between 160 and 255 then it is considered as a female speaker. 2.2 Speech segregation and re-synthesis For segmenting the mixture signal, a binary mask is generated to eliminate the unwanted TF units. Basic idea is to eliminate the interfering pitch frequency, its nearby frequencies and its harmonics. 1 0 P I 2P I 3P I Fig 2: Schematic representation binary mask each frame. Binary mask is created in such a way as to eliminate frequencies in the range interfering pitch frequencies and harmonics. Equation (2) represents the binary mask, where k represents the order the harmonics (here k varies -10 to 10 otherwise it is from -15 to 15). The binary mask each frame is then multiplied with a cosine window given by (3) Mask the entire TF unit can be expressed as (4) Speech segregation is done by multiplying x(j,i) with mask(j,i), where x(j,i) is the STFT mixture speech (5) Re-synthesis the segregated signal is performed by Inverse STFT. In the proposed system 1024 point STFT with a hamming window is implemented. 3. Results And Discussion We have computed SNR and PESQ to evaluate the performance the proposed system and compared with those a closely related method 1. In that method authors used modulation frequency representation for pitch determination and st mask method for speech segregation. For evaluating the proposed method, we have taken recorded speech samples male and female speakers having sampling frequency 8 KHz and they are mixed linearly by keeping one them as dominant. The system identified the gender the speaker with an accuracy 93%. Power spectral density plots the clean, segregated signal using method in [1]and the segregated signal using proposed method are provided in figure 3 to demonstrate the performance. Proposed method is implemented in Matlab 7.1. 3.1 SNR We have arbitrarily taken 5 speech samples from male-male mixture, male-female mixture and female- (2)

M.S. Lekshmi and P.S. Sathidevi / Procedia Computer Science 46 ( 2015 ) 122 126 125 female mixture for testing the system and the performance is shown in table 1. SNR is computed using equation (6) where x(n) is clean signal and is the separated signal. db (6) Table 1: SNR segregated dominant speech SNR mixture Mixture male speaker with male speaker Mixture female speaker with female speaker Mixture male speaker with female speaker Table 2: PESQ segregated dominant speech SNR segregated speech using Ref[1] (db) SNR segregated speech using Proposed system(db) -6.56-0.377 3.36-7.61-6.06 2.55-6.79-2.64 2.96 PESQ mixture PESQ segregated speech using Ref 1 PESQ segregated speech using Proposed system Mixture male speaker with male speaker 1.93 2.17 2.27 Mixture female speaker with female speaker 2.25 2.28 2.30 Mixture male speaker with female speaker 1.84 2.05 2.27 Fig 3: Power spectral density plots clean speech (blue), separated speech using 1 (red) and separated speech using proposed system (black) 3.2 PESQ The Perceptual Evaluation Speech Quality (PESQ) is an international standard for estimating the Mean

126 M.S. Lekshmi and P.S. Sathidevi / Procedia Computer Science 46 ( 2015 ) 122 126 Opinion Score (MOS) from both the clean speech signal and its degraded speech signal. PESQ was ficially standardized by the International Telecommunication Union. It gives a score ranging from 0 to 5. 4. Conclusion In this paper, an unsupervised speech segregation method for the separation dominant speech from a speech mixture is proposed. Here, pitch frequencies the dominant and interfering speakers are first determined and then binary masks are created by using this pitch information. The experimental results show that the proposed method yields a better performance compared to the related work 1 in terms SNR and PESQ. References 1. A. Mahmoodzadeh, H. R. Abutalebi, H. Soltanian-Zadeh, H. Sheikhzadeh,Single channel speech separation in modulation frequency domain based on a novel pitch range estimation method, EURASIP Journal on Advances in Signal Processing, 2012. 2. Tolonen T Karjalainen, A computationally efficient multi pitch analysis model, IEEE Transactions on speech and audio processing, November 2000. 3. Hu, Y. and Loizou, P.,Evaluation objective measures for speech enhancement, Proceedings INTERSPEECH-2006, Philadelphia, PA, 4. DeLiang Wang, Guy J. Brown,CASA BOOK principles,algorithms and Applications, IEEE press, 2006 5. Guoning Hu and DeLiang Wang Monaural Speech Segregation based on Pitch Tracking and Amplitude Modulation, IEEE Transactions on neural networks, September 2004. 6. DeLiangWang, On Ideal Binary Mask As the Computational Goal Auditory Scene - Speech Separation by Humans and Machines, p. 181-197, Kluwer Academic, Norwell MA, 2005 7. HartmutTraunmüller and Anders Eriksson, The frequency range the voice fundamental in the speech male and female adults, Department Linguistics, University Stockholm 1994.