Power Normalized Cepstral Coefficient for Speaker Diarization and Acoustic Echo Cancellation

Power Normalized Cepstral Coefficient for Speaker Diarization and Acoustic Echo Cancellation Sherbin Kanattil Kassim P.G Scholar, Department of ECE, Engineering College, Edathala, Ernakulam, India sherbin_kassim@yahoo.co.in Abstract: Developing audio processing methods for extracting audio features is very important as conscious content for determining human behavior. Conventional researches that concentrate on data which have been recorded under constrained conditions, here dealing with data were recorded in completely natural and unpredictable situations. To obtain speech and environmental sound audio signals, need to set benchmark, a pattern of integrated algorithms for sound speech detection and classification, voice and non voice segmentation, speaker segmentation, and prediction. The acoustic system could become instable which would produce loud disturbances. The solution to these problems is the elimination of the echo with an echo suppression or echo cancellation algorithm. Audio feature extraction technique based on Power Normalized Cepstral Coefficient ( PNCC) and gap statistics for speaker segmentation or diarization and prediction of number of speakers is used[1]. Major new features of PNCC that is based on auditory processing, which accomplishes temporal masking is used for speaker segmentation. An adaptive filtering based on LMS algorithm for unwanted echo reduction and to increase communication quality. 7 I. INTRODUCTION Audio signal processing is a highly growing field, where intelligent devices sense and understand human social behavior. Audio signal processing has already attracted researchers in areas such as psychology, ambient intelligence, healthcare, telecommunication etc. Audio processing is relevant for extracting audio features which helps in determining the characteristics and behavior. This work deals with a set of evaluation criteria and test methods for speech recognition systems[2]. Standard methods to evaluate and measure audio signal processing have different limitations. It is very expensive to monitoring humans, which may be limited to a small number of people per observer, which may have inter observer reliability related issues. The presence of a large acoustic coupling between the loudspeaker and microphone would produce a loud echo that would make conversation difficult..automatic speech recognition (ASR) has made great strides with the development of digital signal processing in both software and hardware. Different types of signal features have been proposed by sound recognition community for the task of sound description. The steadiness or dynamicity of the feature, this feature represent a value extracted from the signal at the time given, or a parameter from a model of the signal behavior along time (eg:- mean, standard deviation, Markov model etc).the time extent of the description provided by the feature, explanation applies to only part of the object (eg:- description of the attack of the sound) whereas other anther applies to entire signal (eg:- loudness) [3]. II. SYSTEM ARCHITECTURE In this work the design and software implementation of a computing platform capable of extracting automatically and analyzing audio signals. Audio feature extraction technique based on power normalized cepstral coefficient and gap statistics for speaker segmentation and prediction is used. Speaker segmentation method based on a power normalized cepstral coefficient (PNCC) is used instead of Linear prediction cepstral coefficients (LPCC), Mel-frequency cepstral coefficients (MFCC)[4]. A. Proposed Audio Signal Extraction Architecture:Design the system to continuously collect audio signals in completely natural and unpredictable situations. The proposed SASE architecture can be done in three different steps or stages. Stage 1: Block detection of sound and speech and classification of both environmental and speech sounds. Stage 2: Voiced and non-voiced speech segmentation and sound level meter calculation. Stage 3: Individual speaker segmentation, clustering, unknown number of speaker prediction, noise removal, echo cancelation. B. Voiced Speech Segmentation: In speech analysis, the voiced and non-voiced decision is performed usually is related with pitch analysis. The linking of voiced and nonvoiced decision statistics to pitch analysis can causes unnecessary issues. A pattern recognition approach for deciding whether a given segment of a speech signal should be classified as voiced speech, non voiced(unvoiced speech, and silence), based on measurements made on the signal[5]. The parameters measured are the zero-crossing rate, the energy of speech, the correlation between adjacent speech samples, the first predictor coefficient, LPC(linear predictive coding) analysis, and the energy in the error prediction. The speech segment is allocated or assigned based on a minimum distance rule obtained under the assumption that the measured parameters are distributed according to the multidimensional Gaussian probability density function. The arithmetic means and co-variances for the Gaussian distribution are determined from manually classified speech data. 1) Features and Segmentation: Features for the voice

and nonvoiced segmentation conducted on the blocks of speech data, namely, a). Noninitial maximum of the normalized noisy autocorrelation b). Number of correlation peaks, and c). Normalized spectral entropy. These are computed on a per-frame basis. C. Speaker Segmentation Diarization Using Power Normalized Cepstral Coefficient: Speaker diarization aims to detect who spoke when in large audio segments. A new feature extraction algorithm called Power Normalized Cepstral Coefficients (PNCC) that is based on auditory processing. Speaker diarization is the process of partitioning an input audio stream into acoustically homogeneous segments and clusters these segments belonging to each speaker. Automatic speech transcription system can be used as a pre-processing step. It is performed by combining speaker segmentation clustering etc. Speech and non-speech segments can be separated out using diarization and the speech recognizer system has to process only the audio segments containing speech. across a wireless network. Employing echo cancellation technique the quality of speech can be improved very high extent. The aim of an adaptive filter is to calculate the difference between the desired signal and the adaptive filter output error signal e(n). That e(n) error signal is feed backed into the adaptive filter and its coefficients are cltered algorithmically to minimize a function of that difference which is known as the cost function. If the output of the adaptive filter output is same to that of desired signal the error signal goes to zero or get approaches to zero. Double talk detection is a situation where both talkers speak at the same time[7]. An important requirement for echo cancellation is the handling of doubletalk in a natural manner that does not cause divergence. Hybrid echoes have been inherent because of the advent of the telephone or system. This type of echo is the result of impedance mismatches in the analog local loop. The acoustic echo is also known as a multipath echo which produced as a result of poor voice coupling between the earpiece and microphone in handsets and hands free devices. F. Adaptive Echo Reduction Approach: Audio Input O/P Wiener Filter LPR P N C C HMM / GMM Fig.1 Block diagram for speaker diarization PNCC processing make use of a power-law nonlinearity that replaces the traditional log nonlinearity used in mel frequency cepstral coefficients, a noise reduction algorithm based on asymmetric filtering that suppress background excitation or noise, and a unit that accomplishes temporal masking. Speaker diarization is the process of partitioning an input audio stream into acoustically homogeneous segments and clusters these segments belonging to each speaker. D. Predicting Number of Unknown Speakers: To solve the ambiguity in the unknown speaker clustered segments method, need to find out how many unknown speakers are involved in the conversation. To find a solution for this problem, the standard k-means algorithm and the gapstatistic technique are used to achieve this purpose. The concept used here is that the sum of squares can only decrease as the number of clusters increases, after a certain point, the sum of squares should decrease more slowly than for previous clusters. This point is called the elbow, and is determined to be the optimal number of clusters[6]. E. Echo Reduction: In this global period of communication users demand for enhanced voice quality over wireless networks has driven a new and key technology termed echo reduction, which can provide near wire line voice quality Fig 2. Adaptive Echo Reduction system Echoes are major sources of annoyance in hands-free communications, where the presence of coupling from the far-end signal (loudspeaker) to the near-end signal (microphone) would result in undesired acoustic echo. An adaptive filter algorithmically adjusts or alters its parameters in order to minimise a function of the difference between the desired output d(n) and its actual output y(n). This function is known as the cost function of the adaptive algorithm. The filter h(n) represents the impulse response of the acoustic environment w(n) which represents the adaptive filter which uses to cancel the echo signal. The adaptive filter which aims to equates its output y(n) to the required output d(n). A basic echo canceller used to remove echo in a communication system is shown below. Inverse filterting effect and wiener filter noise reduction is very effective in removing the echo. 8

The echo canceller replicates the transfer function of the echo path for synthesize a replica of the echo. Then the echo canceller minimizes/subtracts the synthesized replica from the combined echo and near-end speech or disturbance signal to obtain the near-end signal. The transfer function is not known. So the transfer function should be found out. At each iteration the error signal, e(n) =d(n)-y (n), is fed back into the filter. This Adaptive echo cancelation technique is introduced to Asymmetric noise suppression.an audio sample of 5-10sec duration is recorded, and to test the echo cancelation algorithm have added echo effect using the MP3 audio editor. Adaptive LMS algorithm is used to get the echo cancelation done. The desired signal is obtained by combining the filter output and the impulse response. Wiener noise reduction algorithm is used to eliminate the noise present in the audio signal.using the three signals Echo return loss enhancement factor is plotted. If getting ERLE on or above 10dB it found to be good result and echo reduction. Pitch detection algorithms can be classified or divided into methods which can be operated in frequency domain, time domain or both together. One set of pitch detection methods uses the detection and timing of sometime domain features also. Different time domain methods use correlation functions or difference norms to detect similarity between the waveform and a time lagged set of them self. Another set of methods operates in the frequency domain by locating sinusoidal peaks in the frequency transform of the input audio signal. Human pitch lies in the interval 80-800 Hz, where the pitch for men is usually around 150 Hz, for women about 250Hz and children a bit higher frequencies around 300Hz, the pitch is needed to construct this part of the speech signal. Some of the commonly used detection method uses Energy, Cepstral, Zero crossing, Pitch based on difference function and autocorrelation. Here used cepstral based pitch detection is used. B. Zero Crossing Rate: In a time domain feature detection method the signal is usually pre processed to accentuate some time domain feature and the time between occurrences of that feature is calculated as the period of the audio signal. The time between occurrences of a particular feature is used as the period estimation and feature detection schemes usually do not use all of the data available[8]. C. Separation of Voiced and Unvoiced Using Zero Ccrossing Rate and Energy of the Speech Signal: In speech analysis, the voiced-nonvoiced decision is usually performed in extracting the information from the speech signals. Used two features to separate the voiced and non voiced parts of speech. These are zero crossing rate (ZCR) and energy of spectrum. Fig 3. Screen shot of the ERLE III. SYSTEM BLOCK REPRESENTATION A. Pitch Extraction: To obtain speech and environmental sound audio signals using an in house built wearable device, benchmarked a set of integrated algorithms (sound speech detection and classification, voice and non voice segmentation, sound level meter calculation, speaker prediction and segmentation etc). Fig 5. Block representation of Voiced and Nonvoiced time calculation. 9 Fig 4. Pitch Detection Model Evaluates the results by dividing the speech sample into some segments and used the zero crossing rate and energy calculations to separate the voiced and non-voiced speech parts. The zero crossing rates are low for voiced part and high for unvoiced part where as the energy is high for voiced part and low for non-voiced part. Thus these methods are proved more effective in separation of voiced and nonvoiced speech parts and estimation of its time [9]. D. Power Normalized Cepstral Coefficients for Robust Speech Recognition:

Fig 6. The Structure of the PNCC feature extraction algorithm The development of PNCC feature extraction was motivated by a desire to obtain a set of practical features for speech recognition that are more robust with respect to acoustical variability, without loss of performance.the processing described in PNCC is followed by a series of nonlinear timevarying operations that are performed using the longer duration temporal analysis that accomplish noise subtraction as well as a degree of robustness with respect to reverberation. E. Speaker Diarization: 10 Fig 7. Speaker Diarization representation Speaker diarization is the process of detecting the turns in speech because of the changing of speaker and clustering the speech from the same speaker together, and thus provides useful information for the structuring and indexing of the audio document. Recorded audio file (.wav) of duration 10 seconds, sampling frequency 8000 Hz will be given as input. After reading the audio file, the audio signal will be filtered using Wiener filter, to remove noise and to smoothen the signal.linear prediction residual is used, privacy can be preserved[10]. Higher the linear prediction order, better the privacy. After this, linear prediction residual will be represented as PNCC will be used as the observed sequence for HMM with states represents the speaker. Since the numbers of speakers are unknown, data will be segmented into an initial number of clusters 10. (Assuming number of speakers is less than 10).Parameters of HMM are initialized by uniform segmentation of the data in terms of 10 clusters and estimate the parameters of the cluster GMM over these segments. The log-likelihood of the combined model is compared with the sum of the log likelihood of the original model. The pair, for which the log likelihood improvement is largest, will be combined. Using this combined model, make a new HMM. This process will be continued until there exists no more states to combine. Now each state corresponds to the single speaker in the audio file and data related to that state corresponds to the speech of that particular speaker. IV. CONCLUSIONS The analysis of longitudinal and unpredictable audio signal is dealt in this project. An efficient architecture has developed to enable continuous audio sensing and scalable methods to gather and analyse audio signal. This is used to capture the changeable/unstable characteristics of the longitudinal and unpredictable audio signals. The analysis of the audio data captured by the device has yielded significantly high performance for audio signal using the proposed architecture. The Echo cancellation algorithm used is successful to find a software solution for the problem of echoes in the communication. The proposed method is completely a software approach without utilizing any hardware components. Algorithm is capable of running in any PC with MATLAB software installed. This new technique provides almost perfect output for canceling echo without losing the speech signals.the results obtained were convincing. The audio of the output speech signals were highly satisfactory and validated the goals. Speaker diarization put forward has emerged as an increasingly important and dedicated domain of speech research. Speaker diarization has been developed in many domains. In this work, linear prediction residual represented in PNCC is used for speaker diarization. Nowadays privacy preserving is more important. Since linear prediction residual is used, privacy can be preserved, as intelligible speech cannot be reconstructed. Using this information physical and mental health of a person can be analyzed.

V. REFERENCES [1] Bin Gao, And Wai Lok Woo, "Wearable Audio Monitoring: Content-Based Processing Methodology and Implementation", IEEE Transactions on Human- Machine Systems, May 24, 2013. [2] Chanwoo Kim and Richard M. Stern, "Power- Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition", IEEE Transactions on Audio, Speech, and Language Processing, 2011. [3] Moore and B. R. Glasberg, "PNCC For Robust Speech Recognition", Carnegie Mellon University, Pittsburgh PA 15213 USA. [4] Mark S. Hawley, Stuart P. Cunningham, Phil D. Green, "A Voice-Input Voice-Output Communication Aid for People With Severe Speech Impairment", IEEE Transactions on Neural Systems and Rehabilitation Engineering, Vol. 21, No. 1, January 2013. [5] Junghsi Lee and Hsu-Chang Huang, "A Robust Double-Talk Detector for Acoustic Echo Cancellation",IMECS 2010. [6] Radhika hinaboina, D.S.Ramkiran, "Adaptive Algorithms For Acoustic Echo Cancellation In Speech Processing", IJRRAS 7 (1) April 2011. [7] Mario Mun oz-organero, Pedro J. Mun oz-merino, "Adapting the Speed of Reproduction of Audio Content and Using Text Reinforcement for Maximizing the Learning Outcome though Mobile Phones",IEEE Transactions on Learning Technologies, Vol. 4, NO. 3, July-September 2011 [8] Sin-Horng Chen, Shaw-Hwa Hwang and Yih-Ru Wang, "An RNN-Based Prosodic Information Synthesizer for Mandarin Text-to-Speech",IEEE Transactions on Speech and Audio Processing, Vol. 6, No. 3, May 1998 [9] J. Ajmera and C. Wooters, A robust speaker clustering algorithm, in Proc. IEEE Automatic Speech Recognition Understand. Workshop, 2003, pp. 411 416. [10] Marro et.al A Two-Step Noise Reduction Technique, in IEEE International Conference on Acoustics, Speech, Signal Processing, Montral, Canada,Vol.1, pp. 289 292, May 2004. 11