Real time speaker recognition from Internet radio

Similar documents
Mel Spectrum Analysis of Speech Recognition using Single Microphone

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Audio Fingerprinting using Fractional Fourier Transform

An Audio Fingerprint Algorithm Based on Statistical Characteristics of db4 Wavelet

Change Point Determination in Audio Data Using Auditory Features

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

Speech and Music Discrimination based on Signal Modulation Spectrum.

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

Electronic disguised voice identification based on Mel- Frequency Cepstral Coefficient analysis

Implementing Speaker Recognition

Audio Similarity. Mark Zadel MUMT 611 March 8, Audio Similarity p.1/23

High-speed Noise Cancellation with Microphone Array

AUTOMATIC SPEECH RECOGNITION FOR NUMERIC DIGITS USING TIME NORMALIZATION AND ENERGY ENVELOPES

Book Chapters. Refereed Journal Publications J11

KONKANI SPEECH RECOGNITION USING HILBERT-HUANG TRANSFORM

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

UNSUPERVISED SPEAKER CHANGE DETECTION FOR BROADCAST NEWS SEGMENTATION

Automotive three-microphone voice activity detector and noise-canceller

Introduction of Audio and Music

Original Research Articles

Speech Synthesis using Mel-Cepstral Coefficient Feature

Mikko Myllymäki and Tuomas Virtanen

MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM

Voice Activity Detection

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

Speech/Music Discrimination via Energy Density Analysis

Using RASTA in task independent TANDEM feature extraction

Speaker Identification using Frequency Dsitribution in the Transform Domain

Simultaneous Recognition of Speech Commands by a Robot using a Small Microphone Array

Power Function-Based Power Distribution Normalization Algorithm for Robust Speech Recognition

Audio Signal Compression using DCT and LPC Techniques

IDENTIFICATION OF SIGNATURES TRANSMITTED OVER RAYLEIGH FADING CHANNEL BY USING HMM AND RLE

Robust Speaker Recognition using Microphone Arrays

Gammatone Cepstral Coefficient for Speaker Identification

Isolated Word Recognition Based on Combination of Multiple Noise-Robust Techniques

Multimedia Signal Processing: Theory and Applications in Speech, Music and Communications

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Discriminative Training for Automatic Speech Recognition

Voice Activity Detection for Speech Enhancement Applications

Electric Guitar Pickups Recognition

An Improved Voice Activity Detection Based on Deep Belief Networks

Spatial Audio Transmission Technology for Multi-point Mobile Voice Chat

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

Separating Voiced Segments from Music File using MFCC, ZCR and GMM

Autonomous Vehicle Speaker Verification System

A multi-class method for detecting audio events in news broadcasts

Applications of Music Processing

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

User-friendly Matlab tool for easy ADC testing

HIGH QUALITY AUDIO CODING AT LOW BIT RATE USING WAVELET AND WAVELET PACKET TRANSFORM

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE

Robust Speaker Identification for Meetings: UPC CLEAR 07 Meeting Room Evaluation System

Power Normalized Cepstral Coefficient for Speaker Diarization and Acoustic Echo Cancellation

Qäf) Newnes f-s^j^s. Digital Signal Processing. A Practical Guide for Engineers and Scientists. by Steven W. Smith

Chapter 4 SPEECH ENHANCEMENT

FPGA implementation of DWT for Audio Watermarking Application

Automatic Morse Code Recognition Under Low SNR

Evaluation of Audio Compression Artifacts M. Herrera Martinez

Epoch Extraction From Emotional Speech

Speaker and Noise Independent Voice Activity Detection

ARM BASED WAVELET TRANSFORM IMPLEMENTATION FOR EMBEDDED SYSTEM APPLİCATİONS

Relative phase information for detecting human speech and spoofed speech

MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Multiple Sound Sources Localization Using Energetic Analysis Method

Speech/Data discrimination in Communication systems

Isolated Digit Recognition Using MFCC AND DTW

MPEG-4 Structured Audio Systems

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Environmental Sound Recognition using MP-based Features

Monophony/Polyphony Classification System using Fourier of Fourier Transform

SIMULATION VOICE RECOGNITION SYSTEM FOR CONTROLING ROBOTIC APPLICATIONS

SpeakerID - Voice Activity Detection

A DEVICE FOR AUTOMATIC SPEECH RECOGNITION*

Audio /Video Signal Processing. Lecture 1, Organisation, A/D conversion, Sampling Gerald Schuller, TU Ilmenau

CHAPTER 2 FIR ARCHITECTURE FOR THE FILTER BANK OF SPEECH PROCESSOR

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

An Optimization of Audio Classification and Segmentation using GASOM Algorithm

VOICE COMMAND RECOGNITION SYSTEM BASED ON MFCC AND DTW

Assistant Lecturer Sama S. Samaan

License Plate Localisation based on Morphological Operations

OF HIGH QUALITY AUDIO SIGNALS

REVERB Workshop 2014 SINGLE-CHANNEL REVERBERANT SPEECH RECOGNITION USING C 50 ESTIMATION Pablo Peso Parada, Dushyant Sharma, Patrick A. Naylor, Toon v

Improved SIFT Matching for Image Pairs with a Scale Difference

Selected Research Signal & Information Processing Group

Applying the Filtered Back-Projection Method to Extract Signal at Specific Position

Iris Segmentation & Recognition in Unconstrained Environment

Optical Channel Access Security based on Automatic Speaker Recognition

Modulation Spectrum Power-law Expansion for Robust Speech Recognition

DEEP LEARNING BASED AUTOMATIC VOLUME CONTROL AND LIMITER SYSTEM. Jun Yang (IEEE Senior Member), Philip Hilmes, Brian Adair, David W.

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Extraction and Recognition of Text From Digital English Comic Image Using Median Filter

Transcription:

Real time speaker recognition from Internet radio Radoslaw Weychan, Tomasz Marciniak, Agnieszka Stankiewicz, Adam Dabrowski Poznan University of Technology Faculty of Computing Science Chair of Control and Systems Engineering Division of Signal Processing and Electronic Systems E-mail: tomasz.marciniak@put.poznan.pl Abstract The paper presents an analysis of speaker activity in online recordings from the Internet radio. The proposed system has been developed in the Matlab environment. Our research is based on four 1-hour length public debates acquired from the Internet radio. 7 8 speakers (including one presenter) participated in the recordings. The speaker recognition was performed on short utterances to facilitate real time processing. The time of speech for each speaker has been calculated with the use of the Gaussian mixture model (GMM) algorithm. An influence of MPEG layer 3 compression algorithm on mel frequency cepstral coefficients (MFCC s) has been described. An analysis of the neighborhood of the speaker models have been done with the use of the ISOMAP algorithm. Keywords Speaker recognition, GMM, Internet radio, ISOMAP I. INTRODUCTION Speaker identification is an interesting type of biometric identification, since the speech signal can be used as the authorization technique to access many services and systems such as: banks, voicemail, information services, or restricted areas. The process of identification consists of analysis of person s voice and comparison to a set of the previously registered speakers. Parameters of the input signal are compared to the reference models to select the most similar one, giving the appropriate speaker identification. In this paper we present an on-line speaker recognition system that uses recordings of public debates from the Internet radio. A related approach, described in [1], [2], is based on estimation of the direction of arrival of the signal in order to identify the current speaker. However, such solution requires access to the recording studio and the advanced acquisition technique, available for the broadcaster only. Additionally such approach is sensitive to the movements of speakers and acquisition devices. In [3] the authors focus on the speaker segmentation techniques as a preprocessing stage of the speaker identification from spoken documents. A real time working application requires a very fast and accurate comparison between available models, thus it is not an easy task. For the application to work in real time, only very short speech signals can be analysed, e.g., 1 second long recordings. This is why our previous research [4] confirmed a possibility of highly efficient (fast and accurate) speaker recognition form short input signals. It has also to be noticed that in the case of autonomous embedded systems [5], not only short sequences should be processed but also the algorithms used to extract features and modelling should not be complex in order to make real time processing possible. This is just our goal to search for methods that can easily be moved to hardware solutions. II. SPEAKER RECOGNITION FUNDAMENTALS Speaker recognition is based on individual features extracted from voice. These systems work in two main phases: 1) training in this step a database of speaker models is created 2) testing the main step, in which the input signal is processed in order to find the best match among the database models. In both stages, the following steps are proceeded: 1) feature extraction typically mel-frequency cepstral coefficients (MFCC) [6] realized in the following steps for every input signal frame: multiplying by window function (typically Hamming window) computing the fast Fourier transform (FFT) of the input signal mel-scaling of FFT coefficients with the use of the mel-bank filters Computing the logarithm of the previously calculated coefficients Computing the discrete cosine transform from values obtained in the previous step 2) modeling typically using the Gaussian mixture models (GMM s) [7], i.e., sums of weighted Gaussian distributions describing sets of cepstral coefficients. In the testing stage the following distances between the computed models are typically calculated: Euclidian distance Mahalanobis distance [8] Kulback-Leibler divergence [9] logarithm probability. The described steps are presented in Fig. 1, which is a commonly used model-based schema for the speaker recognition [4].

1) play audio stream 2) process buffered stream. Fig. 1. Automatic speaker recognition system schema The presented project uses VOICEBOX [10] the speech processing toolbox for the Matlab environment, which includes most of the commonly used functions in speech processing related to analysis, synthesis, processing, modeling and coding: 1) MFCC calculation -,,Melcepst function with inputs of signal frame and sampling rate: 12 coefficients are calculated for every signal frame 2) GMM modeling,,gaussmix function with inputs MFCC s and a number of Gaussians 3) GMM modeling and log probability calculation,,gaussmixp function, which inputs MFCC s and the previously computed speaker models to compare, including a set of means, variances, and weights. Figure 2 presents an illustrative distribution of the 1st MFCC coefficient and the computed mixture of 16 gaussians, which fit the distribution of the input data. The program flow is presented in Fig. 3. The input signal is sampled 44100 times per second and acquired into the buffer of length being equivalent to 1 second of recording. It is passed through to constant play and at the same time copied to provide speaker recognition. To recognize the speaker on the fly, the input stream is downsampled 4 times to 11025 S/s (samples per second), which is over the minimum 8000 S/s required for the speech processing. This sampling rate was chosen because of the downsample factor equal to 4. In this case there is no need to upsample and downsample by high factors, which have be used in the case of 8000 S/s stream. In next step the signal is cut into frames and the melfrequency cepstral coefficients are calculated for each of them. Then they are expressed with the Gaussian mixture models and compared to the set of the previously defined speakers. The best match is presented in the graphical user interface (GUI) updated in real time (Fig. 4). Fig. 2. Adaptation of gaussian weighted mixtures to input data distributions Fig. 3. Online speaker recognition software schema The acquisition of the MPEG layer 3 (MP3) [11] compressed audio stream is done by the Matlab DSP class. One of its objects is AudioFileReader, which reads audio samples from the declared input file or stream. This object can call function step, which constantly acquires the declared number of samples being processed in the background. It is also possible to interrupt by the timer overflow, but the authors have chosen the first method because of its legibility. III. SYSTEM DESCRIPTION The idea of the system is to acquire and process the sound signal in real time from the Internet stream. The presented system was developed using the Matlab environment. It uses new methods to process the sound the DSP class with objects described in previous section. The software contains two main threads: Fig. 4. The GUI of the described system is presented in Fig. 4. User Interface of the system The software allows to choose speaker models for comparison, while the input stream is processed. The input stream can also be replaced with the previously recorded stream. This option has been used to analyse the recordings described in Section IV.

Fig. 5. Influence of MP3 compression (8 bitrates) on distribution of first MFCC IV. EXPERIMENTAL RESULTS It has to be noticed that the presented system uses the compressed speech signal. Unlike in studies described in [12], [13], the new algorithm uses MPEG layer 3, which is a commonly used approach to compress music movie soundtracks. According to [14], signal transcoding decreases the speaker recognition accuracy. To visualize how the MP3 algorithm influences the signal, an illustrative distribution of the first MFCC s (calculated from the 5-second long recording) have been transcoded with the use of 8 various bitrates from 16 to 160 kbps. The results are presented in Fig. 5. It can be observed, that while the bitstream lowers, the distribution of MFCC is more smoothed and some important features can be removed. In the case of the prepared recordings, the bitstream used is 128 kbps. In relation to Fig. 5, the dominant value of MFCC has been shifted with reduction of the value of occurrences. The presented algorithm for the speaker identification has been described in [4], [15] with FAR/FRR (false acceptance rate/false rejection rate) plots. Studies prepared in this article focused on the application of this algorithm to analyse the public debates available in the Internet radio. Four about 1- hour length recordings of Breakfast with the third program of Polish radio (,,Śniadanie z trójka ) have been prepared. The acquired utterances are in the Polish language. In the discussion participated 8 or 9 speakers, including 7 to 8 politicians and one presenter (Ms. Beata Michniewicz). Every speaker had finite time to present his/her own position in the discussed topic. Our goal was to check the time used of every speaker. In the case of manually obtained juxtaposition, the time needed is much more greater than the length of the analysed recording. In the case of real-time calculations, the time is almost equal to the length of the program. To prepare a database of speakers for the training stage, 5-seconds length recording of each speaker has been used to extract cepstral coefficients. In the testing step, which means analysis of the program, 1-second recordings have been used. The usage of such short input signals is determined by the character of the recorded conversations that included a lot of short pauses between words. These moments of silence and the background noise can cause incorrect speaker identification. Figures 6 to 9 presents analysis of recordings, while table I includes a detailed statistical statement. Fig. 6. Speakers activity in program no 1

Fig. 10. ISOMAP neighbour graph preparation schema Figure 11 presents obtained neighborhood graph of speakers participating in the debates. Fig. 7. Speakers activity in program no 2 Fig. 11. 2-dimensional ISOMAP space of neighborhoods TABLE I. ANALYSIS OF SPEAKERS ACTIVITY Fig. 8. Speakers activity in program no 3 Program Mean Standard deviation max min [min sec] [min sec] [min sec] 1 6:42 1:11 3:54 2 6:47 1:52 4:53 3 7:36 1:57 5:36 4 6:52 1:49 5:10 V. FUTURE WORK According to our previous research [15], [18], the presented system can be improved with the use of the voice activity detection (VAD) algorithms, which maximize information content in the signal and improve the overall accuracy. Because of signal compression, which lowers the effectiveness, other speaker models in the database can be used. Additionally, the system has to be tested against speakers that do not exist in the database. To avoid invalid speaker detection, the thresholding operation can be added. The ISOMAP algorithm proposed to visualize the neighborhood graph can also be used to find the closest match between the obtained models. Fig. 9. Speakers activity in program no 4 In order to illustrate the distances between the speaker models and their neighbours the ISOMAP [16], [17] algorithm has been used. The algorithm is typically used to reduce dimensionality in the geodesic space of the nonlinear data manifold. To provide the neighborhood graph, models of MFCC have been computed for all of 15 speakers. Then the logarithm probabilities between each of the models have been calculated to obtain 2-dimensional probability matrix [15 15]. In the next step the matrix (of the same dimensions) of the Euclidian distances has been computed. This matrix is required as input to the ISOMAP algorithm. The calculation flow is presented in Fig. 10. VI. CONCLUSION In this paper we presented the automatic speaker recognition system from encoded Internet radio signals. The described system was developed with the use of the Matlab environment. Actually Matlab community dose not include any solution regarding such software. We plan to share our application in the Matlab Cetral File Exchange to make it publicly available. Results obtained from the analysis of 4 hours of recordings show, that speaker recognition can be done in real time and can significantly improve analysis of the speaking time. In the case of the MPEG layer 3 compression algorithm used to speech signal even the best chosen quality flattens the distribution of MFCC s. As it has been proved in [13], the proper model selection can substantially improve the system effectiveness.

REFERENCES [1] S. Araki, T. Hori, M. Fujimoto, S. Watanabe, T. Yoshioka, T. Nakatani, and A.Nakamura, Online meeting recognizer with multichannel speaker diarization, in Signals, Systems and Computers (ASILOMAR), 2010 Conference Record of the Forty Fourth Asilomar Conference on, Nov 2010, pp. 1697 1701. [2] A. Plinge and G. A. Fink, Online multi-speaker tracking using multiple microphone arrays informed by auditory scene analysis, in Signal Processing Conference (EUSIPCO), 2013 Proceedings of the 21st European, Sept 2013, pp. 1 5. [3] K. Park, J.-S. Park, and Y.-H. Oh, GMM adaptation based online speaker segmentation for spoken document retrieval, Consumer Electronics, IEEE Transactions on, vol. 56, no. 2, pp. 1123 1129, 2010. [4] T. Marciniak, R. Weychan, A. Dabrowski, and A. Krzykowska, Speaker recognition based on short Polish sequences, IEEE SPA: Signal Processing Algorithms, Architectures, Arrangements, and Applications Conference Proceedings, pp. 95 98, 2010. [5] Z. Piotrowski, J. Wojtun, and K. Kaminski, Subscriber authentication using GMM and tms320c6713dsp, Przeglad Elektrotechniczny, no. 12a/2012, pp. 127 130, 2012. [6] S. Molau, M. Pitz, R. Schluter, and H. Ney, Computing Mel-frequency cepstral coefficients on the power spectrum, in Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP 01). 2001 IEEE International Conference on, vol. 1, 2001, pp. 73 76. [7] D. Reynolds, Gaussian mixture models, Encyclopedia of Biometrics, pp. 659 663, 2009. [8] R. D. Maesschalck, D. Jouan-Rimbaud, and D. Massart, The Mahalanobis distance, Chemometrics and Intelligent Laboratory Systems, vol. 50, no. 1, pp. 1 18, 2000. [Online]. Available: http://www.sciencedirect.com/science/article/pii/s0169743999000477 [9] J. R. Hershey and R. A. Olsen, Approximating the Kullback Leibler divergence between gaussian mixture models. in ICASSP (4), 2007, pp. 317 320. [10] M. Brookes, VOICEBOX: Speech Processing Toolbox for MATLAB, 2005. [11] M. Bosi, K. Brandenburg, S. Quackenbush, L. Fielder, K. Akagiri, H. Fuchs, and M. Dietz, ISO/IEC MPEG-2 Advanced Audio Coding, J. Audio Eng. Soc, vol. 45, no. 10, pp. 789 814, 1997. [Online]. Available: http://www.aes.org/e-lib/browse.cfm?elib=10271 [12] R. Weychan, T. Marciniak, and A. Dabrowski, Analysis of differences between MFCC after multiple GSM transcodings, Przeglad Elektrotechniczny, pp. 24 29, 2012. [13] R. Weychan, A. Stankiewicz, T. Marciniak, and A. Dabrowski, Improving of speaker identification from mobile telephone calls, in Multimedia Communications, Services and Security, ser. Communications in Computer and Information Science, 2014, vol. 429, pp. 254 264. [14] A. Dabrowski, S. Drgas, and T. Marciniak, Detection of GSM speech coding for telephone call classification and automatic speaker recognition, ICSES, pp. 415 418, 2008. [15] T. Marciniak, R. Weychan, A. Dabrowski, and A. Krzykowska, Influence of silence removal on speaker recognition based on short Polish sequences, IEEE SPA: Signal Processing Algorithms, Architectures, Arrangements, and Applications Conference Proceedings, pp. 159 163, 2011. [16] J. B. Tenenbaum, V. D. Silva, and J. C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science, vol. 290, no. 5500, pp. 2319 2323, 2000. [17] G. Wen, L. Jiang, and J. Wen, Using locally estimated geodesic distance to optimize neighborhood graph for isometric data embedding, Pattern Recognition, vol. 41, no. 7, pp. 2226 2236, 2008. [18] T. Marciniak, R. Weychan, and A. Krzykowska, Speaker recognition based on telephone quality short Polish sequences with removed silence, Przeglad Elektrotechniczny, pp. 42 46, 2012. This work was partly supported by the project Scholarship support for PH.D. students specializing in major strategic development for Wielkopolska, Sub-measure 8.2.2 Human Capital Operational Programme, co-financed by the European Union under the European Social Fund. We would like to thank our students Albert Malina and Pawel Dymarkowski for preparation the basic software and the graphical user interface.