End-to-End Deep Learning Framework for Speech Paralinguistics Detection Based on Perception Aware Spectrum

Size: px
Start display at page:

Download "End-to-End Deep Learning Framework for Speech Paralinguistics Detection Based on Perception Aware Spectrum"

Transcription

1 INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden End-to-End Deep Learning Framework for Speech Paralinguistics Detection Based on Perception Aware Spectrum Danwei Cai 12, Zhidong Ni 12, Wenbo Liu 1, Weicheng Cai 1, Gang Li 3, Ming Li 12 1 School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, China 2 SYSU-CMU Shunde International Joint Research Institute, Guangdong, China 3 Jiangsu Jinling Science and Technology Group Limited, Jiangsu, China liming46@mail.sysu.edu.cn Abstract In this paper, we propose an end-to-end deep learning framework to detect speech paralinguistics using perception aware spectrum as input. Existing studies show that speech under cold has distinct variations of energy distribution on low frequency components compared with the speech under healthy condition. This motivates us to use perception aware spectrum as the input to an end-to-end learning framework with small scale dataset. In this work, we try both Constant Q Transform (CQT) spectrum and Gammatone spectrum in different end-toend deep learning networks, where both spectrums are able to closely mimic the human speech perception and transform it into 2D images. Experimental results show the effectiveness of the proposed perception aware spectrum with end-to-end deep learning approach on Interspeech 2017 Computational Paralinguistics Cold sub-challenge. The final fusion result of our proposed method is 8% better than that of the provided baseline in terms of UAR. Index Terms: computational paralinguistics, speech under cold, deep learning, perception aware spectrum 1. Introduction Speech paralinguistics study the non-verbal signals of speech including accent, emotion, modulation, fluency and other perceptible speech phenomena beyond the pure transcriptional content of spoken speech [1]. With the advent of computational paralinguistics, such phenomena can be analysed by machine learning methods. The Interspeech COMPUTATIONAL PAR- ALINGUISTICS CHALLENGE (COMPARE) is an open Challenge in the field of Computational Paralinguistics since Interspeech 2017 ComParE Challenge addressed three new problems within the field of Computational Paralinguistics: Addressee sub-challenge, Cold sub-challenge and Snoring subchallenge [2]. In this paper, we proposed an efficient deep learning architecture for Cold sub-challenge of the Interspeech 2017 Computational Paralinguistics ChallengE [2]. The task aims to differentiate the cold-affected speech from the normal speech. The baseline of challenge includes three independent systems. The first two systems use traditional classification method (i.e. This research was funded in part by the National Natural Science Foundation of China ( ), Natural Science Foundation of Guangdong Province (2014A ), Natural Science Foundation of Guangzhou City ( ), the Fundamental Research Funds for the Central Universities (15lgjc12), National Key Research and Development Program (2016YFC ) and IBM Faculty Award. SVM) with COMPARE features representation [3] and bag-ofaudio-words (BoAW) features representation [4], and achieve unweighted average recall (UAR) of 64.0 and 64.2 respectively. The third system employs end-to-end learning but only achieves UAR of Similar to [5], this system uses a convolutional network to extract features from the raw audio and then a subsequent recurrent network (i.e. LSTM) performs the final classification [2]. During the past few years, deep learning has made significant progress. Deep learning methods outperform the traditional machine learning methods in variety of speech applications such as speech recognition [6], language recognition [7], text-dependent speaker verification [8], emotion recognition [5], anti-spoofing tasks. This motivates us to apply deep learning methods to computational paralinguistic tasks. However, the end-to-end baseline system provided in [2] did not achieve better UAR than the other two baseline systems. One possible reason is that small scale dataset may not be able to drive the deep neural network to learn a good feature directly from waveform for classification, and hard to obtain a robust feature for classification. We thus look into the frequency representation (i.e spectrograms) to perform the end-to-end learning. Spectrograms is a widely used audio signal feature representation in deep learning, which contain more wealth of acoustic information. Existing study shows that compared with speech in health condition, the speech in cold has larger amplitude in low frequency components and lower amplitude in high frequency components. [9]. Also, from the viewpoint of a human auditory perceptual system, human ears are more sensitive to small changes in low frequencies [10]. This motivates us to use perception aware spectrograms (i.e. Gammatone spectrograms and Constant Q Transform spectrograms) as the input for end-toend deep learning framework when performing computational paralinguistics tasks. Constant Q transform employs geometrically spaced frequency bins and ensures a constant Q factor across the entire spectrum. This results in a finer frequency resolution at low frequencies while provides a higher temporal resolution at high frequencies [11]. Gammatone spectrum employs Gammatone filters which are conceived as a simple fit to experimental observations of the mammalian cochlea, and have a repeated pole structure leading to an impulse response that is the product of a gamma envelope g(t) = t n e t and a sinusoid (tone) [12, 13]. To the best of our knowledge, deep learning framework with CQT spectrograms input has been successfully applied to piano music transcription [14], audio scene classification and domestic audio tagging [15]. But the performance of deep learning framework with Gammatone spectrograms input still Copyright 2017 ISCA

2 remains to be investigated. In this work, we try different network architecture with the above two perception aware spectrum, and find that perception aware spectrum outperforms the conventional short-term Fourier Transform (STFT) spectrum in the paralinguistic speech tasks of cold-affected speech. We think that our proposed method is applicable to other computational paralinguistic speech tasks as well. The remainder of this paper is organized as follows. In next section, we will describe the proposed methods and background on its major components. Section 3 presents the dataset and experimental results. A brief conclusion is given in section Methods 2.1. Perception aware spectrum STFT spectrograms Traditionally, discrete-time short-term Fourier transform is used to generate spectrograms of the time representation audio signals. Actually, the STFT is a filter bank. The Q factor defined as the ratio between the center frequency f k and the frequency bandwidth f is a measure of the selectivity of each filter: Q = f k f In STFT, the Q factor increases with the frequencies since the bandwidth f related to the window function is identical for all filters. However, human s ears can easily perceive small changes of low frequencies, but for high frequencies only gross differences can be detected. Human perception system is known to approximate a constant Q factor between 500Hz and 20kHz [10]. As a result, STFT spectrum with varied Q may not be good enough for speech signals analysis but perception aware spectrum can provide more discriminant information for coldaffected speech detection and other computational paralinguistic tasks CQT spectrograms The first perception aware spectrum we try in the end-to-end deep learning framework is constant Q transform spectrograms. It was introduced by Youngberg and Boll [16] in 1978 and refined by Brown [17] some years later in In contrast to the fixed time-frequency resolution of Fourier methods, CQT ensures a constant Q factor across the entire spectrum and thus gives a higher frequency resolution for low frequencies and a higher temporal resolution for high frequencies. The CQT X(k, n) of a discrete time signal x(n) can be calculated by X(k, n) = n+ N k /2 j=n N k /2 (1) x(j)a k(j n + N k /2) (2) where k is the index of the frequency bins, N k is a variable window lengths and a k (n) are complex-valued waveforms, here also called time-frequency atoms, which are defined as a k (n) = 1 C w( n N k ) exp [ i ( 2πn f k f s + Φ k where f k is the center frequency of the corresponding frequency bin, f s is the sampling rate, w(t) is a window function and Φ k is a phase offset. C is a scaling factor given by C = N k /2 l= N k /2 ( ) l + Nk /2 w N k )] (3) (4) STFT spectrogram CQT spectrogram Gammatone spectrogram Figure 1: Spectrograms of train 0250.wav in URTIC dataset. Spectrograms computed with the short time Fourier Transform (top), with the constant Q transform (middle) and with Gammatone filters (bottom). Since a bin spacing corresponding to the equal temperament is desired, the center frequencies f k obey f k = f 12 k 1 B (5) where f 1 is the center frequency of the lowest-frequency bin and B is a constant determines the time-frequency resolution trade-off. We then can write the Q factor as f ( 1 k Q = = 2 1/B 1) (6) f k+1 f k We can finally write the window lengths N k which is inversely proportional to f k to ensure a constant Q for all frequency bins as N k = fs f k Q (7) Gammatone spectrograms The second perception aware spectrum we try in the end-to-end deep learning framework is Gammatone spectrograms. Gammatone filters are a linear approximation to the filtering performed by the ear. To get a Gammatone spectrum, the audio signal is first analysed using a multi-channel Gammatone filterbank [18] and then the energy across each time frames is summed up [12]. Figure 1 shows STFT, CQT-derived spectrogram and Gammatone spectrogram for an arbitrarily selected speech signal from the dataset of the Cold sub-challenge. It is obvious that 3453

3 Figure 2: The end-to-end network architecture with perception aware spectrograms input. The deep learning network consists of 4 convolution layers, 1 GRU layer and 1 fully connected layer. both CQT spectrum and Gammatone spectrum emphasize the low frequencies. The major difference between CQT spectrum and Gammatone spectrum is their low frequencies components. CQT spectrum gives a good frequency resolution but a bad time resolution, for it ensures a constant Q factor. Gammatone spectrum provides a smoother frequency resolution as human cochlea and a relatively good time resolution, for it apply Gammatone filters within each regular time bins. It s hard to say which kind of spectrogram will be better for the cold-affected speech detection tasks as well as other computational paralinguistic tasks. But one thing for sure is that perception aware spectrograms which reflect more closely the human perception system will provide more information in low frequencies and help the deep learning neural network to learn discriminate features for classification End-to-end deep learning framework To perform end-to-end learning in cold-affected speech detection task, we combine convolutional neural network (CNN) and recurrent neural networks (RNN) to learn features automatically. The general combinations scheme is as follows. First, a convolutional neural network acts as the feature extractor on the input perception aware spectrum input. Then, the CNN s output is feed into a recurrent neural network. The output of the CNN is a set of channels (i.e. feature maps). In our network, the 3-D tensor output of CNN is interpreted as a set of 2D-tensors along the time axis and each 2D-tensor contains the information from every channel. We employ a single gated recurrent unit (GRU) layer on 2D slices of that tensor and this enable the information from different channels mix inside the GRU. Finally, a fully connected layer with a softmax layer performs on the RNN s output to do classification. Figure 2 illustrates our end-to-end network architecture. The CNN-LSTM deep learning framework has been successfully applied in the paralinguistic task of detecting spontaneous or natural emotions in a speech, except this work use a raw time representation input [5]. This framework with some residual connections ( shortcuts ) from input to RNN and from CNN to fully connected layers has also been use in speech recognition [19] CQCC and MFCC in GMM framework To verify the effectiveness of the end-to-end deep learning network upon perception aware spectrum, we use CQCC and MFCC as the perception aware features to train classifier. CQCC is based on constant Q transform which is already perception aware. The constant Q cepstral coefficients (CQCCs) of a discrete time signal with CQT X(k) at a particular frame can then be extracted according to: [ ( ) ] L p l 1 CQCC(p) = log X(l) 2 2 π cos (8) L l=1 where p = 0, 1,, L 1 and l are the newly resampled frequency bins [11]. For MFCC, the Mel-cepstrum applies a frequency scale based on auditory critical bands before cepstral analysis [20]. The Mel-frequency cepstral coefficients (MFCCs) of a discrete time signal with DFT X(k) at a particular frame can then be extracted according to: [ ( ) ] M q m 1 2 π MF CC(q) log [MF (m)] cos (9) M m=1 where the MF (m) is the Mel-frequency spectrum and is defined as K MF (m) = X(k) 2 H m(k) (10) k=1 where k is the DFT index and H m(k) is the triangular weighting-shaped function for the m-th Mel-scaled bandpass filter. Two Gaussian mixture models (GMMs) is trained on one kind of perception aware features and used as a 2-class classifier in which the classes correspond to cold-affected and normal speech. The final score of a given test speech is computed as the log-likelihood ratio of the two GMMs Dataset 3. Experiments We use the UPPER RESPIRATORY TRACT INFECTION COR- PUS (URTIC) provided by the Institute of Safety Technology, University of Wuppertal, Germany. The corpus consists of instances with a duration between 3 and 10 seconds. 3454

4 Table 1: End-to-end learning network architecture. FC: fully connected layer. conv: convolutional layer. Network CNN+GRU+FC CNN+FC Configuration conv1: kernels, 1 stride conv2: kernels, 1 stride conv3: kernels, 1 stride conv4: kernels, 1 stride pooling: 3 3 pool, 2 1 stride GRU: 500 hidden units FC: classification layer conv: same as above pooling: same as above FC1: 50 hidden units FC2: classification layer 9505 instances were selected for training, 9596 for the development set, and 9551 for testing. The URTIC corpus is imbalanced: the number of coldaffected speech for training is 970 but the number of healthy speech is 8535 [2]. However, a neural network trained on an imbalanced dataset may not be discriminative enough between classes [21]. To address this issue, we employ the simplest resampling technique by over-sampling the minority class with duplication when training end-eo-end deep learning networks Experimental results We first use CQCC features to model the cold-affected speech and normal speech by employing two 512-components Gaussian mixture models and calculate the log-likelihood ratio upon these two GMMs for each test speech. We also use MFCC features with the same setup. The UAR with CQCC and MFCC features are 65.4% and 64.8% respectively, which is slightly better than the challenge organizer s SVM based results. We then apply STFT spectrum, CQT spectrum and Gammatone spectrum to different end-to-end learning networks. Firstly, the training data is cut into a series of 3 seconds speech with an overlap of 2 seconds. We then extract different kinds of spectrograms which are STFT spectrograms, CQT spectrograms and Gammatone spectrograms, the column number of these three spectrums are different due to the different frame shift parameters. All of which are used as input for the neural network. See table 1 for the details of the network architecture. During neural network training phase, we use batch normalization to speed up. As the data are fed forward into a deep network, the parameters of the current layer adjust the input data and change the input data distribution for the next layer, which refers to as internal covariate shift. Batch normalization addresses the problem of internal covariate shift by normalizing layer inputs [22]. We also employ dropout to counter overfitting in training the neural network when labelled data is scarce [23]. Table 2 shows the experimental results of the baseline system and our proposed systems. It is observed that both CQT spectrum and Gammatone spectrum outperform the STFT spectrum in the case of UAR with the CNN+GRU+FC network setup. The best result of our end-to-end system (CQT spectrum with CNN+GRU+FC) is 15.7% better than the provided end-to-end network (raw waveforms with CNN+LATM). We use BOSARIS toolkit[24] to fuse the system results. The fusion results show that CQT and Gammatone spectrum are com- Table 2: URTIC development set results for predicting the coldaffected speech. Algorithms ID Inputs UAR SVM [2] 1 COMPARE functional 64.0% 2 COMPARE BoAW 64.2% GMM 3 MFCC 64.8% 4 CQCC 65.4% CNN + LSTM [2] 5 Time representation 59.1% 6 STFT spectrum 64.1% CNN + FC 7 CQT spectrum 68.5% 8 Gammatone spectrum 65.6% CNN + GRU + FC Fusion 9 STFT spectrum 66.7% 10 CQT spectrum 68.4% 11 Gammatone spectrum 67.7% [2] 66.1 % % % % % % plementary to each other, and so does different neural network architectures. GMM system with CQCC or MFCC also helps to improve the system performance. The final fusion result of the URTIC development set is 71.4% and is 8% better than that of the provided baseline. The final fusion result of the test set, which is 66.71%, unfortunately suffers overfitting. We fuse it with the COMPARE functional baseline (70.2%)[2] and get 71.2% UAR of the test set. 4. Conclusion In this paper, we propose to use perception aware spectrum in end-to-end deep neural network to perform the computational paralinguistic task of detecting cold-affected speech. In the small scale datasets, perception aware spectrum such as CQT spectrum and Gammatone spectrum outperforms the raw time domain representation even the conventional STFT spectrum in end-to-end learning. We also investigate the performance of perception aware feature such as CQCC and MFCC when feeding it into GMMs which serve as a classification algorithm and verify the effectiveness of deep learning network with proper designed architecture and perception aware spectrum input. We have tried different spectrum input in different neural network architectures as well as the conventional classifier, fusing the results of these system brings a performance gain and shows that these features and methods are significant complementary to each other. The computational paralinguistic task of detecting coldaffected speech still remains many problems to be investigated. For example, we have tried to use a phone decoder upon the given dataset and separately model three kinds of phone set which consist of vowel, nasal and other consonant with the split utterance. The experimental results show little discrimination between the three phone model mentioned above. This may due to the inaccurate phone decoder as well as the imbalanced phone set model training data. In the further work, we will try more accurate phone decoder and more proper modeling algorithms. Moreover, we will try to combine this idea with attention based neural network. 3455

5 5. References [1] B. Schuller, The computational paralinguistics challenge [social sciences], IEEE Signal Processing Magazine, vol. 29, pp , [2] B. Schuller, S. Steidl, A. Batliner, E. Bergelson, J. Krajewski, C. Janott, A. Amatuni, M. Casillas, A. Seidl, M. Soderstrom et al., The interspeech 2017 computational paralinguistics challenge: Addressee, cold & snoring, in Proceedings of INTERSPEECH, [3] B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi et al., The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism, in Proceedings of INTERSPEECH, [4] M. Schmitt and B. W. Schuller, openxbow-introducing the passau open-source crossmodal bag-of-words toolkit, preprint arxiv: , [5] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network, in Proceedings of ICASSP, 2016, pp [6] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, Journal of IEEE Signal Processing Magazine, vol. 29, no. 6, pp , [7] J. Gonzalez-Dominguez, I. Lopez-Moreno, H. Sak, J. Gonzalez- Rodriguez, and P. J. Moreno, Automatic language identification using long short-term memory recurrent neural networks. in Proceedings of INTERSPEECH, 2014, pp [8] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, End-to-end text-dependent speaker verification, in Proceedings of ICASSP, 2016, pp [9] Y. Shan and Q. Zhu, Speaker identification under the changed sound environment, in Proceedings of ICALIP, 2014, pp [10] B. C. Moore, An introduction to the psychology of hearing. Brill, [11] M. Todisco, H. Delgado, and N. Evans, A new feature for automatic speaker verification anti-spoofing: Constant Q cepstral coefficients, in Proceedings of Speaker Odyssey Workshop, Bilbao, Spain, vol. 25, 2016, pp [12] D. Ellis, Gammatone-like spectrograms, web resource: columbia.edu/dpwe/resources/matlab/gammatonegram, [13] Y. Shao, S. Srinivasan, and D. Wang, Incorporating auditory feature uncertainties in robust speaker identification, in Proceedings of ICASSP, vol. 4, 2007, pp. IV 277. [14] S. Sigtia, E. Benetos, and S. Dixon, An end-to-end neural network for polyphonic piano music transcription, IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 24, pp , [15] T. Lidy and A. Schindler, CQT-based convolutional neural networks for audio scene classification and domestic audio tagging, DCASE 2016 Challenge, [16] J. Youngberg and S. Boll, Constant-Q signal analysis and synthesis, in Proceedings of ICASSP, vol. 3, 1978, pp [17] J. C. Brown, Calculation of a constant Q spectral transform, The Journal of the Acoustical Society of America, vol. 89, no. 1, pp , [18] M. Cooke, Modelling Auditory Processing and Organization: Distinguished Dissertations in Computer Science Series. Cambridge University Press Cambridge, [19] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, Convolutional, long short-term memory, fully connected deep neural networks, in Proceedings of ICASSP, 2015, pp [20] S. Davis and P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, in Proceedings of ICASSP, vol. 28, 1980, pp [21] E. DeRouin, J. Brown, H. Beck, L. Fausett, and M. Schneider, Neural network training on unequally represented classes, Intelligent engineering systems through artificial neural networks, pp , [22] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in Proceedings of ICML, [23] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, vol. 15, pp , [24] N. Brmmer and E. D. Villiers, The bosaris toolkit: Theory, algorithms and code for surviving the new DCF, in NIST SRE Analysis Workshop,

End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input

End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input Emre Çakır Tampere University of Technology, Finland emre.cakir@tut.fi

More information

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches Performance study of Text-independent Speaker identification system using & I for Telephone and Microphone Speeches Ruchi Chaudhary, National Technical Research Organization Abstract: A state-of-the-art

More information

Speech Synthesis using Mel-Cepstral Coefficient Feature

Speech Synthesis using Mel-Cepstral Coefficient Feature Speech Synthesis using Mel-Cepstral Coefficient Feature By Lu Wang Senior Thesis in Electrical Engineering University of Illinois at Urbana-Champaign Advisor: Professor Mark Hasegawa-Johnson May 2018 Abstract

More information

Learning the Speech Front-end With Raw Waveform CLDNNs

Learning the Speech Front-end With Raw Waveform CLDNNs INTERSPEECH 2015 Learning the Speech Front-end With Raw Waveform CLDNNs Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals Google, Inc. New York, NY, U.S.A {tsainath, ronw, andrewsenior,

More information

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives Mathew Magimai Doss Collaborators: Vinayak Abrol, Selen Hande Kabil, Hannah Muckenhirn, Dimitri

More information

Speech Signal Analysis

Speech Signal Analysis Speech Signal Analysis Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 2&3 14,18 January 216 ASR Lectures 2&3 Speech Signal Analysis 1 Overview Speech Signal Analysis for

More information

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Comparison of Spectral Analysis Methods for Automatic Speech Recognition INTERSPEECH 2013 Comparison of Spectral Analysis Methods for Automatic Speech Recognition Venkata Neelima Parinam, Chandra Vootkuri, Stephen A. Zahorian Department of Electrical and Computer Engineering

More information

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Dimension Reduction of the Modulation Spectrogram for Speaker Verification Dimension Reduction of the Modulation Spectrogram for Speaker Verification Tomi Kinnunen Speech and Image Processing Unit Department of Computer Science University of Joensuu, Finland Kong Aik Lee and

More information

IMPROVEMENTS TO THE IBM SPEECH ACTIVITY DETECTION SYSTEM FOR THE DARPA RATS PROGRAM

IMPROVEMENTS TO THE IBM SPEECH ACTIVITY DETECTION SYSTEM FOR THE DARPA RATS PROGRAM IMPROVEMENTS TO THE IBM SPEECH ACTIVITY DETECTION SYSTEM FOR THE DARPA RATS PROGRAM Samuel Thomas 1, George Saon 1, Maarten Van Segbroeck 2 and Shrikanth S. Narayanan 2 1 IBM T.J. Watson Research Center,

More information

HOW DO DEEP CONVOLUTIONAL NEURAL NETWORKS

HOW DO DEEP CONVOLUTIONAL NEURAL NETWORKS Under review as a conference paper at ICLR 28 HOW DO DEEP CONVOLUTIONAL NEURAL NETWORKS LEARN FROM RAW AUDIO WAVEFORMS? Anonymous authors Paper under double-blind review ABSTRACT Prior work on speech and

More information

Change Point Determination in Audio Data Using Auditory Features

Change Point Determination in Audio Data Using Auditory Features INTL JOURNAL OF ELECTRONICS AND TELECOMMUNICATIONS, 0, VOL., NO., PP. 8 90 Manuscript received April, 0; revised June, 0. DOI: /eletel-0-00 Change Point Determination in Audio Data Using Auditory Features

More information

arxiv: v2 [cs.sd] 22 May 2017

arxiv: v2 [cs.sd] 22 May 2017 SAMPLE-LEVEL DEEP CONVOLUTIONAL NEURAL NETWORKS FOR MUSIC AUTO-TAGGING USING RAW WAVEFORMS Jongpil Lee Jiyoung Park Keunhyoung Luke Kim Juhan Nam Korea Advanced Institute of Science and Technology (KAIST)

More information

Improving reverberant speech separation with binaural cues using temporal context and convolutional neural networks

Improving reverberant speech separation with binaural cues using temporal context and convolutional neural networks Improving reverberant speech separation with binaural cues using temporal context and convolutional neural networks Alfredo Zermini, Qiuqiang Kong, Yong Xu, Mark D. Plumbley, Wenwu Wang Centre for Vision,

More information

Using RASTA in task independent TANDEM feature extraction

Using RASTA in task independent TANDEM feature extraction R E S E A R C H R E P O R T I D I A P Using RASTA in task independent TANDEM feature extraction Guillermo Aradilla a John Dines a Sunil Sivadas a b IDIAP RR 04-22 April 2004 D a l l e M o l l e I n s t

More information

Deep learning architectures for music audio classification: a personal (re)view

Deep learning architectures for music audio classification: a personal (re)view Deep learning architectures for music audio classification: a personal (re)view Jordi Pons jordipons.me @jordiponsdotme Music Technology Group Universitat Pompeu Fabra, Barcelona Acronyms MLP: multi layer

More information

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Dimension Reduction of the Modulation Spectrogram for Speaker Verification Dimension Reduction of the Modulation Spectrogram for Speaker Verification Tomi Kinnunen Speech and Image Processing Unit Department of Computer Science University of Joensuu, Finland tkinnu@cs.joensuu.fi

More information

AUDIO TAGGING WITH CONNECTIONIST TEMPORAL CLASSIFICATION MODEL USING SEQUENTIAL LABELLED DATA

AUDIO TAGGING WITH CONNECTIONIST TEMPORAL CLASSIFICATION MODEL USING SEQUENTIAL LABELLED DATA AUDIO TAGGING WITH CONNECTIONIST TEMPORAL CLASSIFICATION MODEL USING SEQUENTIAL LABELLED DATA Yuanbo Hou 1, Qiuqiang Kong 2 and Shengchen Li 1 Abstract. Audio tagging aims to predict one or several labels

More information

Auditory Based Feature Vectors for Speech Recognition Systems

Auditory Based Feature Vectors for Speech Recognition Systems Auditory Based Feature Vectors for Speech Recognition Systems Dr. Waleed H. Abdulla Electrical & Computer Engineering Department The University of Auckland, New Zealand [w.abdulla@auckland.ac.nz] 1 Outlines

More information

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise Noha KORANY 1 Alexandria University, Egypt ABSTRACT The paper applies spectral analysis to

More information

Applications of Music Processing

Applications of Music Processing Lecture Music Processing Applications of Music Processing Christian Dittmar International Audio Laboratories Erlangen christian.dittmar@audiolabs-erlangen.de Singing Voice Detection Important pre-requisite

More information

DERIVATION OF TRAPS IN AUDITORY DOMAIN

DERIVATION OF TRAPS IN AUDITORY DOMAIN DERIVATION OF TRAPS IN AUDITORY DOMAIN Petr Motlíček, Doctoral Degree Programme (4) Dept. of Computer Graphics and Multimedia, FIT, BUT E-mail: motlicek@fit.vutbr.cz Supervised by: Dr. Jan Černocký, Prof.

More information

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS Kuldeep Kumar 1, R. K. Aggarwal 1 and Ankita Jain 2 1 Department of Computer Engineering, National Institute

More information

CP-JKU SUBMISSIONS FOR DCASE-2016: A HYBRID APPROACH USING BINAURAL I-VECTORS AND DEEP CONVOLUTIONAL NEURAL NETWORKS

CP-JKU SUBMISSIONS FOR DCASE-2016: A HYBRID APPROACH USING BINAURAL I-VECTORS AND DEEP CONVOLUTIONAL NEURAL NETWORKS CP-JKU SUBMISSIONS FOR DCASE-2016: A HYBRID APPROACH USING BINAURAL I-VECTORS AND DEEP CONVOLUTIONAL NEURAL NETWORKS Hamid Eghbal-Zadeh Bernhard Lehner Matthias Dorfer Gerhard Widmer Department of Computational

More information

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Mel Spectrum Analysis of Speech Recognition using Single Microphone International Journal of Engineering Research in Electronics and Communication Mel Spectrum Analysis of Speech Recognition using Single Microphone [1] Lakshmi S.A, [2] Cholavendan M [1] PG Scholar, Sree

More information

Monitoring Infant s Emotional Cry in Domestic Environments using the Capsule Network Architecture

Monitoring Infant s Emotional Cry in Domestic Environments using the Capsule Network Architecture Interspeech 2018 2-6 September 2018, Hyderabad Monitoring Infant s Emotional Cry in Domestic Environments using the Capsule Network Architecture M. A. Tuğtekin Turan and Engin Erzin Multimedia, Vision

More information

Mel- frequency cepstral coefficients (MFCCs) and gammatone filter banks

Mel- frequency cepstral coefficients (MFCCs) and gammatone filter banks SGN- 14006 Audio and Speech Processing Pasi PerQlä SGN- 14006 2015 Mel- frequency cepstral coefficients (MFCCs) and gammatone filter banks Slides for this lecture are based on those created by Katariina

More information

arxiv: v2 [cs.sd] 31 Oct 2017

arxiv: v2 [cs.sd] 31 Oct 2017 END-TO-END SOURCE SEPARATION WITH ADAPTIVE FRONT-ENDS Shrikant Venkataramani, Jonah Casebeer University of Illinois at Urbana Champaign svnktrm, jonahmc@illinois.edu Paris Smaragdis University of Illinois

More information

Robust speech recognition using temporal masking and thresholding algorithm

Robust speech recognition using temporal masking and thresholding algorithm Robust speech recognition using temporal masking and thresholding algorithm Chanwoo Kim 1, Kean K. Chin 1, Michiel Bacchiani 1, Richard M. Stern 2 Google, Mountain View CA 9443 USA 1 Carnegie Mellon University,

More information

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE Zhizheng Wu 1,2, Xiong Xiao 2, Eng Siong Chng 1,2, Haizhou Li 1,2,3 1 School of Computer Engineering, Nanyang Technological University (NTU),

More information

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS AKSHAY CHANDRASHEKARAN ANOOP RAMAKRISHNA akshayc@cmu.edu anoopr@andrew.cmu.edu ABHISHEK JAIN GE YANG ajain2@andrew.cmu.edu younger@cmu.edu NIDHI KOHLI R

More information

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume - 3 Issue - 8 August, 2014 Page No. 7727-7732 Performance Analysis of MFCC and LPCC Techniques in Automatic

More information

Advanced audio analysis. Martin Gasser

Advanced audio analysis. Martin Gasser Advanced audio analysis Martin Gasser Motivation Which methods are common in MIR research? How can we parameterize audio signals? Interesting dimensions of audio: Spectral/ time/melody structure, high

More information

Detecting Media Sound Presence in Acoustic Scenes

Detecting Media Sound Presence in Acoustic Scenes Interspeech 2018 2-6 September 2018, Hyderabad Detecting Sound Presence in Acoustic Scenes Constantinos Papayiannis 1,2, Justice Amoh 1,3, Viktor Rozgic 1, Shiva Sundaram 1 and Chao Wang 1 1 Alexa Machine

More information

Relative phase information for detecting human speech and spoofed speech

Relative phase information for detecting human speech and spoofed speech Relative phase information for detecting human speech and spoofed speech Longbiao Wang 1, Yohei Yoshida 1, Yuta Kawakami 1 and Seiichi Nakagawa 2 1 Nagaoka University of Technology, Japan 2 Toyohashi University

More information

END-TO-END SOURCE SEPARATION WITH ADAPTIVE FRONT-ENDS

END-TO-END SOURCE SEPARATION WITH ADAPTIVE FRONT-ENDS END-TO-END SOURCE SEPARATION WITH ADAPTIVE FRONT-ENDS Shrikant Venkataramani, Jonah Casebeer University of Illinois at Urbana Champaign svnktrm, jonahmc@illinois.edu Paris Smaragdis University of Illinois

More information

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b R E S E A R C H R E P O R T I D I A P On Factorizing Spectral Dynamics for Robust Speech Recognition a Vivek Tyagi Hervé Bourlard a,b IDIAP RR 3-33 June 23 Iain McCowan a Hemant Misra a,b to appear in

More information

arxiv: v1 [cs.sd] 1 Oct 2016

arxiv: v1 [cs.sd] 1 Oct 2016 VERY DEEP CONVOLUTIONAL NEURAL NETWORKS FOR RAW WAVEFORMS Wei Dai*, Chia Dai*, Shuhui Qu, Juncheng Li, Samarjit Das {wdai,chiad}@cs.cmu.edu, shuhuiq@stanford.edu, {billy.li,samarjit.das}@us.bosch.com arxiv:1610.00087v1

More information

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2 Signal Processing for Speech Applications - Part 2-1 Signal Processing For Speech Applications - Part 2 May 14, 2013 Signal Processing for Speech Applications - Part 2-2 References Huang et al., Chapter

More information

IMPROVING WIDEBAND SPEECH RECOGNITION USING MIXED-BANDWIDTH TRAINING DATA IN CD-DNN-HMM

IMPROVING WIDEBAND SPEECH RECOGNITION USING MIXED-BANDWIDTH TRAINING DATA IN CD-DNN-HMM IMPROVING WIDEBAND SPEECH RECOGNITION USING MIXED-BANDWIDTH TRAINING DATA IN CD-DNN-HMM Jinyu Li, Dong Yu, Jui-Ting Huang, and Yifan Gong Microsoft Corporation, One Microsoft Way, Redmond, WA 98052 ABSTRACT

More information

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012 Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012 o Music signal characteristics o Perceptual attributes and acoustic properties o Signal representations for pitch detection o STFT o Sinusoidal model o

More information

Training neural network acoustic models on (multichannel) waveforms

Training neural network acoustic models on (multichannel) waveforms View this talk on YouTube: https://youtu.be/si_8ea_ha8 Training neural network acoustic models on (multichannel) waveforms Ron Weiss in SANE 215 215-1-22 Joint work with Tara Sainath, Kevin Wilson, Andrew

More information

A New Framework for Supervised Speech Enhancement in the Time Domain

A New Framework for Supervised Speech Enhancement in the Time Domain Interspeech 2018 2-6 September 2018, Hyderabad A New Framework for Supervised Speech Enhancement in the Time Domain Ashutosh Pandey 1 and Deliang Wang 1,2 1 Department of Computer Science and Engineering,

More information

MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM

MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM www.advancejournals.org Open Access Scientific Publisher MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM ABSTRACT- P. Santhiya 1, T. Jayasankar 1 1 AUT (BIT campus), Tiruchirappalli, India

More information

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation Peter J. Murphy and Olatunji O. Akande, Department of Electronic and Computer Engineering University

More information

Deep Neural Network Architectures for Modulation Classification

Deep Neural Network Architectures for Modulation Classification Deep Neural Network Architectures for Modulation Classification Xiaoyu Liu, Diyu Yang, and Aly El Gamal School of Electrical and Computer Engineering Purdue University Email: {liu1962, yang1467, elgamala}@purdue.edu

More information

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification Daryush Mehta SHBT 03 Research Advisor: Thomas F. Quatieri Speech and Hearing Biosciences and Technology 1 Summary Studied

More information

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

Introducing COVAREP: A collaborative voice analysis repository for speech technologies Introducing COVAREP: A collaborative voice analysis repository for speech technologies John Kane Wednesday November 27th, 2013 SIGMEDIA-group TCD COVAREP - Open-source speech processing repository 1 Introduction

More information

Attention-based Multi-Encoder-Decoder Recurrent Neural Networks

Attention-based Multi-Encoder-Decoder Recurrent Neural Networks Attention-based Multi-Encoder-Decoder Recurrent Neural Networks Stephan Baier 1, Sigurd Spieckermann 2 and Volker Tresp 1,2 1- Ludwig Maximilian University Oettingenstr. 67, Munich, Germany 2- Siemens

More information

ACOUSTIC SCENE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORKS

ACOUSTIC SCENE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORKS ACOUSTIC SCENE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORKS Daniele Battaglino, Ludovick Lepauloux and Nicholas Evans NXP Software Mougins, France EURECOM Biot, France ABSTRACT Acoustic scene classification

More information

Automatic Morse Code Recognition Under Low SNR

Automatic Morse Code Recognition Under Low SNR 2nd International Conference on Mechanical, Electronic, Control and Automation Engineering (MECAE 2018) Automatic Morse Code Recognition Under Low SNR Xianyu Wanga, Qi Zhaob, Cheng Mac, * and Jianping

More information

REVERBERATION-BASED FEATURE EXTRACTION FOR ACOUSTIC SCENE CLASSIFICATION. Miloš Marković, Jürgen Geiger

REVERBERATION-BASED FEATURE EXTRACTION FOR ACOUSTIC SCENE CLASSIFICATION. Miloš Marković, Jürgen Geiger REVERBERATION-BASED FEATURE EXTRACTION FOR ACOUSTIC SCENE CLASSIFICATION Miloš Marković, Jürgen Geiger Huawei Technologies Düsseldorf GmbH, European Research Center, Munich, Germany ABSTRACT 1 We present

More information

JOINT NOISE AND MASK AWARE TRAINING FOR DNN-BASED SPEECH ENHANCEMENT WITH SUB-BAND FEATURES

JOINT NOISE AND MASK AWARE TRAINING FOR DNN-BASED SPEECH ENHANCEMENT WITH SUB-BAND FEATURES JOINT NOISE AND MASK AWARE TRAINING FOR DNN-BASED SPEECH ENHANCEMENT WITH SUB-BAND FEATURES Qing Wang 1, Jun Du 1, Li-Rong Dai 1, Chin-Hui Lee 2 1 University of Science and Technology of China, P. R. China

More information

Topic. Spectrogram Chromagram Cesptrogram. Bryan Pardo, 2008, Northwestern University EECS 352: Machine Perception of Music and Audio

Topic. Spectrogram Chromagram Cesptrogram. Bryan Pardo, 2008, Northwestern University EECS 352: Machine Perception of Music and Audio Topic Spectrogram Chromagram Cesptrogram Short time Fourier Transform Break signal into windows Calculate DFT of each window The Spectrogram spectrogram(y,1024,512,1024,fs,'yaxis'); A series of short term

More information

arxiv: v1 [cs.sd] 7 Jun 2017

arxiv: v1 [cs.sd] 7 Jun 2017 SOUND EVENT DETECTION USING SPATIAL FEATURES AND CONVOLUTIONAL RECURRENT NEURAL NETWORK Sharath Adavanne, Pasi Pertilä, Tuomas Virtanen Department of Signal Processing, Tampere University of Technology

More information

Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home

Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home Chanwoo

More information

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Single Channel Speaker Segregation using Sinusoidal Residual Modeling NCC 2009, January 16-18, IIT Guwahati 294 Single Channel Speaker Segregation using Sinusoidal Residual Modeling Rajesh M Hegde and A. Srinivas Dept. of Electrical Engineering Indian Institute of Technology

More information

Gammatone Cepstral Coefficient for Speaker Identification

Gammatone Cepstral Coefficient for Speaker Identification Gammatone Cepstral Coefficient for Speaker Identification Rahana Fathima 1, Raseena P E 2 M. Tech Student, Ilahia college of Engineering and Technology, Muvattupuzha, Kerala, India 1 Asst. Professor, Ilahia

More information

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection Detection Lecture usic Processing Applications of usic Processing Christian Dittmar International Audio Laboratories Erlangen christian.dittmar@audiolabs-erlangen.de Important pre-requisite for: usic segmentation

More information

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals 16 3. SPEECH ANALYSIS 3.1 INTRODUCTION TO SPEECH ANALYSIS Many speech processing [22] applications exploits speech production and perception to accomplish speech analysis. By speech analysis we extract

More information

Research on Hand Gesture Recognition Using Convolutional Neural Network

Research on Hand Gesture Recognition Using Convolutional Neural Network Research on Hand Gesture Recognition Using Convolutional Neural Network Tian Zhaoyang a, Cheng Lee Lung b a Department of Electronic Engineering, City University of Hong Kong, Hong Kong, China E-mail address:

More information

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b R E S E A R C H R E P O R T I D I A P Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR a Vivek Tyagi Hervé Bourlard a,b IDIAP RR 3-47 September 23 Iain McCowan a Hemant Misra a,b to appear

More information

DNN AND CNN WITH WEIGHTED AND MULTI-TASK LOSS FUNCTIONS FOR AUDIO EVENT DETECTION

DNN AND CNN WITH WEIGHTED AND MULTI-TASK LOSS FUNCTIONS FOR AUDIO EVENT DETECTION DNN AND CNN WITH WEIGHTED AND MULTI-TASK LOSS FUNCTIONS FOR AUDIO EVENT DETECTION Huy Phan, Martin Krawczyk-Becker, Timo Gerkmann, and Alfred Mertins University of Lübeck, Institute for Signal Processing,

More information

An Improved Voice Activity Detection Based on Deep Belief Networks

An Improved Voice Activity Detection Based on Deep Belief Networks e-issn 2455 1392 Volume 2 Issue 4, April 2016 pp. 676-683 Scientific Journal Impact Factor : 3.468 http://www.ijcter.com An Improved Voice Activity Detection Based on Deep Belief Networks Shabeeba T. K.

More information

CROSS-LAYER FEATURES IN CONVOLUTIONAL NEURAL NETWORKS FOR GENERIC CLASSIFICATION TASKS. Kuan-Chuan Peng and Tsuhan Chen

CROSS-LAYER FEATURES IN CONVOLUTIONAL NEURAL NETWORKS FOR GENERIC CLASSIFICATION TASKS. Kuan-Chuan Peng and Tsuhan Chen CROSS-LAYER FEATURES IN CONVOLUTIONAL NEURAL NETWORKS FOR GENERIC CLASSIFICATION TASKS Kuan-Chuan Peng and Tsuhan Chen Cornell University School of Electrical and Computer Engineering Ithaca, NY 14850

More information

BEAT DETECTION BY DYNAMIC PROGRAMMING. Racquel Ivy Awuor

BEAT DETECTION BY DYNAMIC PROGRAMMING. Racquel Ivy Awuor BEAT DETECTION BY DYNAMIC PROGRAMMING Racquel Ivy Awuor University of Rochester Department of Electrical and Computer Engineering Rochester, NY 14627 rawuor@ur.rochester.edu ABSTRACT A beat is a salient

More information

Frequency Estimation from Waveforms using Multi-Layered Neural Networks

Frequency Estimation from Waveforms using Multi-Layered Neural Networks INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Frequency Estimation from Waveforms using Multi-Layered Neural Networks Prateek Verma & Ronald W. Schafer Stanford University prateekv@stanford.edu,

More information

The Munich 2011 CHiME Challenge Contribution: BLSTM-NMF Speech Enhancement and Recognition for Reverberated Multisource Environments

The Munich 2011 CHiME Challenge Contribution: BLSTM-NMF Speech Enhancement and Recognition for Reverberated Multisource Environments The Munich 2011 CHiME Challenge Contribution: BLSTM-NMF Speech Enhancement and Recognition for Reverberated Multisource Environments Felix Weninger, Jürgen Geiger, Martin Wöllmer, Björn Schuller, Gerhard

More information

ACOUSTIC SCENE CLASSIFICATION: FROM A HYBRID CLASSIFIER TO DEEP LEARNING

ACOUSTIC SCENE CLASSIFICATION: FROM A HYBRID CLASSIFIER TO DEEP LEARNING ACOUSTIC SCENE CLASSIFICATION: FROM A HYBRID CLASSIFIER TO DEEP LEARNING Anastasios Vafeiadis 1, Dimitrios Kalatzis 1, Konstantinos Votis 1, Dimitrios Giakoumis 1, Dimitrios Tzovaras 1, Liming Chen 2,

More information

SOUND SOURCE RECOGNITION AND MODELING

SOUND SOURCE RECOGNITION AND MODELING SOUND SOURCE RECOGNITION AND MODELING CASA seminar, summer 2000 Antti Eronen antti.eronen@tut.fi Contents: Basics of human sound source recognition Timbre Voice recognition Recognition of environmental

More information

A JOINT DETECTION-CLASSIFICATION MODEL FOR AUDIO TAGGING OF WEAKLY LABELLED DATA. Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark D.

A JOINT DETECTION-CLASSIFICATION MODEL FOR AUDIO TAGGING OF WEAKLY LABELLED DATA. Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark D. A JOINT DETECTION-CLASSIFICATION MODEL FOR AUDIO TAGGING OF WEAKLY LABELLED DATA Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark D. Plumbley Center for Vision, Speech and Signal Processing (CVSSP) University

More information

Acoustic modelling from the signal domain using CNNs

Acoustic modelling from the signal domain using CNNs Acoustic modelling from the signal domain using CNNs Pegah Ghahremani 1, Vimal Manohar 1, Daniel Povey 1,2, Sanjeev Khudanpur 1,2 1 Center of Language and Speech Processing 2 Human Language Technology

More information

Isolated Digit Recognition Using MFCC AND DTW

Isolated Digit Recognition Using MFCC AND DTW MarutiLimkar a, RamaRao b & VidyaSagvekar c a Terna collegeof Engineering, Department of Electronics Engineering, Mumbai University, India b Vidyalankar Institute of Technology, Department ofelectronics

More information

Cepstrum alanysis of speech signals

Cepstrum alanysis of speech signals Cepstrum alanysis of speech signals ELEC-E5520 Speech and language processing methods Spring 2016 Mikko Kurimo 1 /48 Contents Literature and other material Idea and history of cepstrum Cepstrum and LP

More information

Converting Speaking Voice into Singing Voice

Converting Speaking Voice into Singing Voice Converting Speaking Voice into Singing Voice 1 st place of the Synthesis of Singing Challenge 2007: Vocal Conversion from Speaking to Singing Voice using STRAIGHT by Takeshi Saitou et al. 1 STRAIGHT Speech

More information

Robust Speaker Identification for Meetings: UPC CLEAR 07 Meeting Room Evaluation System

Robust Speaker Identification for Meetings: UPC CLEAR 07 Meeting Room Evaluation System Robust Speaker Identification for Meetings: UPC CLEAR 07 Meeting Room Evaluation System Jordi Luque and Javier Hernando Technical University of Catalonia (UPC) Jordi Girona, 1-3 D5, 08034 Barcelona, Spain

More information

Robust Algorithms For Speech Reconstruction On Mobile Devices

Robust Algorithms For Speech Reconstruction On Mobile Devices Robust Algorithms For Speech Reconstruction On Mobile Devices XU SHAO A Thesis presented for the degree of Doctor of Philosophy Speech Group School of Computing Sciences University of East Anglia England

More information

Adaptive Filters Application of Linear Prediction

Adaptive Filters Application of Linear Prediction Adaptive Filters Application of Linear Prediction Gerhard Schmidt Christian-Albrechts-Universität zu Kiel Faculty of Engineering Electrical Engineering and Information Technology Digital Signal Processing

More information

Acoustic Modeling from Frequency-Domain Representations of Speech

Acoustic Modeling from Frequency-Domain Representations of Speech Acoustic Modeling from Frequency-Domain Representations of Speech Pegah Ghahremani 1, Hossein Hadian 1,3, Hang Lv 1,4, Daniel Povey 1,2, Sanjeev Khudanpur 1,2 1 Center of Language and Speech Processing

More information

Introduction of Audio and Music

Introduction of Audio and Music 1 Introduction of Audio and Music Wei-Ta Chu 2009/12/3 Outline 2 Introduction of Audio Signals Introduction of Music 3 Introduction of Audio Signals Wei-Ta Chu 2009/12/3 Li and Drew, Fundamentals of Multimedia,

More information

Endpoint Detection using Grid Long Short-Term Memory Networks for Streaming Speech Recognition

Endpoint Detection using Grid Long Short-Term Memory Networks for Streaming Speech Recognition INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Endpoint Detection using Grid Long Short-Term Memory Networks for Streaming Speech Recognition Shuo-Yiin Chang, Bo Li, Tara N. Sainath, Gabor Simko,

More information

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Different Approaches of Spectral Subtraction Method for Speech Enhancement ISSN 2249 5460 Available online at www.internationalejournals.com International ejournals International Journal of Mathematical Sciences, Technology and Humanities 95 (2013 1056 1062 Different Approaches

More information

Investigating Modulation Spectrogram Features for Deep Neural Network-based Automatic Speech Recognition

Investigating Modulation Spectrogram Features for Deep Neural Network-based Automatic Speech Recognition Investigating Modulation Spectrogram Features for Deep Neural Network-based Automatic Speech Recognition DeepakBabyand HugoVanhamme Department ESAT, KU Leuven, Belgium {Deepak.Baby, Hugo.Vanhamme}@esat.kuleuven.be

More information

The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection

The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection Tomi Kinnunen, University of Eastern Finland, FINLAND Md Sahidullah, University of Eastern Finland, FINLAND Héctor

More information

arxiv: v2 [eess.as] 11 Oct 2018

arxiv: v2 [eess.as] 11 Oct 2018 A MULTI-DEVICE DATASET FOR URBAN ACOUSTIC SCENE CLASSIFICATION Annamaria Mesaros, Toni Heittola, Tuomas Virtanen Tampere University of Technology, Laboratory of Signal Processing, Tampere, Finland {annamaria.mesaros,

More information

A Method for Voiced/Unvoiced Classification of Noisy Speech by Analyzing Time-Domain Features of Spectrogram Image

A Method for Voiced/Unvoiced Classification of Noisy Speech by Analyzing Time-Domain Features of Spectrogram Image Science Journal of Circuits, Systems and Signal Processing 2017; 6(2): 11-17 http://www.sciencepublishinggroup.com/j/cssp doi: 10.11648/j.cssp.20170602.12 ISSN: 2326-9065 (Print); ISSN: 2326-9073 (Online)

More information

Robust Speech Recognition Based on Binaural Auditory Processing

Robust Speech Recognition Based on Binaural Auditory Processing INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Robust Speech Recognition Based on Binaural Auditory Processing Anjali Menon 1, Chanwoo Kim 2, Richard M. Stern 1 1 Department of Electrical and Computer

More information

Subband Analysis of Time Delay Estimation in STFT Domain

Subband Analysis of Time Delay Estimation in STFT Domain PAGE 211 Subband Analysis of Time Delay Estimation in STFT Domain S. Wang, D. Sen and W. Lu School of Electrical Engineering & Telecommunications University of ew South Wales, Sydney, Australia sh.wang@student.unsw.edu.au,

More information

Discriminative Training for Automatic Speech Recognition

Discriminative Training for Automatic Speech Recognition Discriminative Training for Automatic Speech Recognition 22 nd April 2013 Advanced Signal Processing Seminar Article Heigold, G.; Ney, H.; Schluter, R.; Wiesler, S. Signal Processing Magazine, IEEE, vol.29,

More information

arxiv: v3 [cs.ne] 21 Dec 2016

arxiv: v3 [cs.ne] 21 Dec 2016 CONVOLUTIONAL RECURRENT NEURAL NETWORKS FOR MUSIC CLASSIFICATION arxiv:1609.04243v3 [cs.ne] 21 Dec 2016 Keunwoo Choi, György Fazekas, Mark Sandler Queen Mary University of London, London, UK Centre for

More information

Evaluating robust features on Deep Neural Networks for speech recognition in noisy and channel mismatched conditions

Evaluating robust features on Deep Neural Networks for speech recognition in noisy and channel mismatched conditions INTERSPEECH 2014 Evaluating robust on Deep Neural Networks for speech recognition in noisy and channel mismatched conditions Vikramjit Mitra, Wen Wang, Horacio Franco, Yun Lei, Chris Bartels, Martin Graciarena

More information

COM325 Computer Speech and Hearing

COM325 Computer Speech and Hearing COM325 Computer Speech and Hearing Part III : Theories and Models of Pitch Perception Dr. Guy Brown Room 145 Regent Court Department of Computer Science University of Sheffield Email: g.brown@dcs.shef.ac.uk

More information

Time-Frequency Distributions for Automatic Speech Recognition

Time-Frequency Distributions for Automatic Speech Recognition 196 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 3, MARCH 2001 Time-Frequency Distributions for Automatic Speech Recognition Alexandros Potamianos, Member, IEEE, and Petros Maragos, Fellow,

More information

A Convolutional Neural Network Smartphone App for Real-Time Voice Activity Detection

A Convolutional Neural Network Smartphone App for Real-Time Voice Activity Detection Received December 11, 2017, accepted January 24, 2018, date of publication February 1, 2018, date of current version March 13, 2018. Digital Object Identifier 10.1109/ACCESS.2018.2800728 A Convolutional

More information

Convolutional Neural Networks for Small-footprint Keyword Spotting

Convolutional Neural Networks for Small-footprint Keyword Spotting INTERSPEECH 2015 Convolutional Neural Networks for Small-footprint Keyword Spotting Tara N. Sainath, Carolina Parada Google, Inc. New York, NY, U.S.A {tsainath, carolinap}@google.com Abstract We explore

More information

SOUND EVENT ENVELOPE ESTIMATION IN POLYPHONIC MIXTURES

SOUND EVENT ENVELOPE ESTIMATION IN POLYPHONIC MIXTURES SOUND EVENT ENVELOPE ESTIMATION IN POLYPHONIC MIXTURES Irene Martín-Morató 1, Annamaria Mesaros 2, Toni Heittola 2, Tuomas Virtanen 2, Maximo Cobos 1, Francesc J. Ferri 1 1 Department of Computer Science,

More information

Complex Sounds. Reading: Yost Ch. 4

Complex Sounds. Reading: Yost Ch. 4 Complex Sounds Reading: Yost Ch. 4 Natural Sounds Most sounds in our everyday lives are not simple sinusoidal sounds, but are complex sounds, consisting of a sum of many sinusoids. The amplitude and frequency

More information

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt Pattern Recognition Part 6: Gerhard Schmidt Christian-Albrechts-Universität zu Kiel Faculty of Engineering Institute of Electrical and Information Engineering Digital Signal Processing and System Theory

More information

Mikko Myllymäki and Tuomas Virtanen

Mikko Myllymäki and Tuomas Virtanen NON-STATIONARY NOISE MODEL COMPENSATION IN VOICE ACTIVITY DETECTION Mikko Myllymäki and Tuomas Virtanen Department of Signal Processing, Tampere University of Technology Korkeakoulunkatu 1, 3370, Tampere,

More information

Speech Enhancement In Multiple-Noise Conditions using Deep Neural Networks

Speech Enhancement In Multiple-Noise Conditions using Deep Neural Networks Speech Enhancement In Multiple-Noise Conditions using Deep Neural Networks Anurag Kumar 1, Dinei Florencio 2 1 Carnegie Mellon University, Pittsburgh, PA, USA - 1217 2 Microsoft Research, Redmond, WA USA

More information

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

International Journal of Modern Trends in Engineering and Research   e-issn No.: , Date: 2-4 July, 2015 International Journal of Modern Trends in Engineering and Research www.ijmter.com e-issn No.:2349-9745, Date: 2-4 July, 2015 Analysis of Speech Signal Using Graphic User Interface Solly Joy 1, Savitha

More information