MUSICAL GENRE CLASSIFICATION OF AUDIO DATA USING SOURCE SEPARATION TECHNIQUES. P.S. Lampropoulou, A.S. Lampropoulos and G.A.

Similar documents
Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

Drum Transcription Based on Independent Subspace Analysis

BEAT DETECTION BY DYNAMIC PROGRAMMING. Racquel Ivy Awuor

Applications of Music Processing

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Chapter 4 SPEECH ENHANCEMENT

REpeating Pattern Extraction Technique (REPET)

Advanced audio analysis. Martin Gasser

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

SUB-BAND INDEPENDENT SUBSPACE ANALYSIS FOR DRUM TRANSCRIPTION. Derry FitzGerald, Eugene Coyle

Audio Similarity. Mark Zadel MUMT 611 March 8, Audio Similarity p.1/23

Image Extraction using Image Mining Technique

Music Signal Processing

Introduction to Wavelet Transform. Chapter 7 Instructor: Hossein Pourghassem

MULTIPLE F0 ESTIMATION IN THE TRANSFORM DOMAIN

An Optimization of Audio Classification and Segmentation using GASOM Algorithm

Tempo and Beat Tracking

An Hybrid MLP-SVM Handwritten Digit Recognizer

A multi-class method for detecting audio events in news broadcasts

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Query by Singing and Humming

Speech/Music Change Point Detection using Sonogram and AANN

Rhythm Analysis in Music

SOUND SOURCE RECOGNITION AND MODELING

FFT analysis in practice

Rhythm Analysis in Music

Single-channel Mixture Decomposition using Bayesian Harmonic Models

A Parametric Model for Spectral Sound Synthesis of Musical Sounds

EC 6501 DIGITAL COMMUNICATION UNIT - II PART A

Tempo and Beat Tracking

Application of Classifier Integration Model to Disturbance Classification in Electric Signals

Audio Imputation Using the Non-negative Hidden Markov Model

International Journal of Digital Application & Contemporary research Website: (Volume 1, Issue 7, February 2013)

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

THE STATISTICAL ANALYSIS OF AUDIO WATERMARKING USING THE DISCRETE WAVELETS TRANSFORM AND SINGULAR VALUE DECOMPOSITION

Introduction of Audio and Music

Automatic Transcription of Monophonic Audio to MIDI

Survey Paper on Music Beat Tracking

COMPUTATIONAL RHYTHM AND BEAT ANALYSIS Nicholas Berkner. University of Rochester

Content Based Image Retrieval Using Color Histogram

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

CHAPTER 1 INTRODUCTION

Sound Recognition. ~ CSE 352 Team 3 ~ Jason Park Evan Glover. Kevin Lui Aman Rawat. Prof. Anita Wasilewska

An Adaptive Algorithm for Speech Source Separation in Overcomplete Cases Using Wavelet Packets

Multiple Sound Sources Localization Using Energetic Analysis Method

Exploring the effect of rhythmic style classification on automatic tempo estimation

CHORD DETECTION USING CHROMAGRAM OPTIMIZED BY EXTRACTING ADDITIONAL FEATURES

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Classification of Analog Modulated Communication Signals using Clustering Techniques: A Comparative Study

Signal segmentation and waveform characterization. Biosignal processing, S Autumn 2012

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Mel Spectrum Analysis of Speech Recognition using Single Microphone

CLASSLESS ASSOCIATION USING NEURAL NETWORKS

Environmental Sound Recognition using MP-based Features

Stock Price Prediction Using Multilayer Perceptron Neural Network by Monitoring Frog Leaping Algorithm

Voice Activity Detection

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Complex Sounds. Reading: Yost Ch. 4

INTRODUCTION TO DEEP LEARNING. Steve Tjoa June 2013

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

AUTOMATED MUSIC TRACK GENERATION

The Role of High Frequencies in Convolutive Blind Source Separation of Speech Signals

Deep learning architectures for music audio classification: a personal (re)view

ORCHIVE: Digitizing and Analyzing Orca Vocalizations

Onset Detection Revisited

Lecture 6. Rhythm Analysis. (some slides are adapted from Zafar Rafii and some figures are from Meinard Mueller)

Change Point Determination in Audio Data Using Auditory Features

SOPA version 2. Revised July SOPA project. September 21, Introduction 2. 2 Basic concept 3. 3 Capturing spatial audio 4

DESIGN AND IMPLEMENTATION OF AN ALGORITHM FOR MODULATION IDENTIFICATION OF ANALOG AND DIGITAL SIGNALS

Chapter IV THEORY OF CELP CODING

Speech and Music Discrimination based on Signal Modulation Spectrum.

Speech Synthesis using Mel-Cepstral Coefficient Feature

Current Harmonic Estimation in Power Transmission Lines Using Multi-layer Perceptron Learning Strategies

Main Subject Detection of Image by Cropping Specific Sharp Area

Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction

Audio Fingerprinting using Fractional Fourier Transform

CHORD RECOGNITION USING INSTRUMENT VOICING CONSTRAINTS

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

ScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking

Timbral Distortion in Inverse FFT Synthesis

Aberehe Niguse Gebru ABSTRACT. Keywords Autocorrelation, MATLAB, Music education, Pitch Detection, Wavelet

NEURALNETWORK BASED CLASSIFICATION OF LASER-DOPPLER FLOWMETRY SIGNALS

An analysis of blind signal separation for real time application

Enhanced MLP Input-Output Mapping for Degraded Pattern Recognition

A SEGMENTATION-BASED TEMPO INDUCTION METHOD

Nonlinear Audio Recurrence Analysis with Application to Music Genre Classification.

A Spatial Mean and Median Filter For Noise Removal in Digital Images

EE 464 Short-Time Fourier Transform Fall and Spectrogram. Many signals of importance have spectral content that

Mikko Myllymäki and Tuomas Virtanen

MATLAB DIGITAL IMAGE/SIGNAL PROCESSING TITLES

ENHANCED BEAT TRACKING WITH CONTEXT-AWARE NEURAL NETWORKS

DETECTION AND CLASSIFICATION OF POWER QUALITY DISTURBANCES

TIME DOMAIN ATTACK AND RELEASE MODELING Applied to Spectral Domain Sound Synthesis

Audio Engineering Society Convention Paper Presented at the 110th Convention 2001 May Amsterdam, The Netherlands

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 1, FEBRUARY A Speech/Music Discriminator Based on RMS and Zero-Crossings

Multimedia Signal Processing: Theory and Applications in Speech, Music and Communications

Pitch Detection Algorithms

Combining Pitch-Based Inference and Non-Negative Spectrogram Factorization in Separating Vocals from Polyphonic Music

Transcription:

MUSICAL GENRE CLASSIFICATION OF AUDIO DATA USING SOURCE SEPARATION TECHNIQUES P.S. Lampropoulou, A.S. Lampropoulos and G.A. Tsihrintzis Department of Informatics, University of Piraeus 80 Karaoli & Dimitriou St, Piraeus 185 34, Greece {vlamp, arislamp, geoatsi}@unipi.gr Abstract. We propose a two-step, audio feature-based musical genre classification methodology. First, we identify and separate the various musical instrument sources in the audio signal, using the convolutive sparse coding algorithm. Next, we extract classification features from the separated signals that correspond to distinct musical instrument sources. The methodology is evaluated and its performance is assessed. Key words: Music signal processing, source separation, music genre classification 1. INTRODUCTION AND WORK OVERVIEW In the recent years, there have been many works on audio content analysis, which use different features and methods [2], [4], [5], [6], [11], [12] to extract information directly from actual music data through automated processes. These methodologies rely on objective content-based metainformation and are to be contrasted with their counterparts in currently available music search engines and peer-to-peer systems (e.g. Kazaa, emule, Torent), in which the retrieval mechanism relies on subjective textual meta-information, such as file names and ID3 tags. The content-based methodologies are developed as a possible solution to the need for systems that have the ability to manage and organize efficiently the large collections of stored music files that came as the result of progress in digital storage technology and the huge increase in the availability of digital music. Most of these techniques focus on automatic music genre classification and organize digital music into categorical labels created by human experts using objective features of the audio signal which relate to instrumentation, timbral texture, rhythmic and pitch content [4], [11]. These methods use pattern recognition techniques and offer the possibility of content-based indexing and retrieval. However, all these works use the complex sound structure of the entire audio signal in a music file to extract the feature vector. In this paper, we propose a new approach for the musical genre classification based on features extracted from signals that correspond to distinct musical instrument sources. Our approach differs from those in previous works in that we first we first detect the various musical instrument sources in a music clip by decomposing the audio signal into a number of component signals, each of which corresponds to a different musical instrument source, as in Fig. 1. Next, we extract timbral, rhythmic and pitch features from the separated instrument sources and use them to classify a music clip. This procedure is similar to a human listener who is able to determine the genre of a music signal and, at the same time, distinguish a number of different musical instruments in a complex sound structure.

The problem of separating the component signals that correspond to distinct musical instruments that generated an audio signal is ill-defined as there is no prior knowledge about the various instrumental sources. Many techniques have been successfully used to solve the general blind source separation problem in several application areas, with the Independent Component Analysis (ICA) method [8,10] appearing to be one of the most promising. ICA assumes that the individual source components in an unknown mixture have the property of mutual statistical independence. This property is exploited in order to algorithmically identify the latent sources. Moreover, ICA-based methods require certain limiting assumptions, such as the assumption that the number of source signals be at most as high as the number of observed mixture signals and that the mixing matrix be full rank. However, a method has been proposed, which is similar to ICA, but relaxes the constraint on the number of observed mixture signals. This is called the Independent Subspace Analysis (ISA) method and can separate individual sources from a single channel mixture by using sound spectra [7]. Signal independence is the main assumption of both the ICA and ISA methods. In musical signals, however, there exist dependencies in both the time and frequency domains. To overcome these limitations, we use in our system a recently proposed data adaptive technique that is similar to ICA and called Convolutive Sparse Coding (CSC) [9]. This method is presented in detail in Section 2.1. The paper is organized as follows: An overall architecture of our proposed system is presented in Section 2, in which we also describe in detail the CSC source separation method and the extraction of music (audio) content-based features. The classification method and results are given in Section 3 and conclusions and suggestions for future work are given in Section 4. 2. PROPOSED SYSTEM ARCHITECTURE The architecture of our proposed system consists of three main modules. The first module realizes the separation of the component signals in the input signal, while the second module extracts features from each signal produced during source separation. Finally, the last module is a supervised classifier of genre and musical instrument. Each music piece can be stored in any audio file format, such as.mp3,.au, or.wav, which requires format normalization before feature extraction. Specifically, we decode each music file to raw Pulse Code Modulation (PCM), using the LAME decoder [14] and converting it to the.wav format with resolution of 16 bit samples at a sampling rate of 22.050 Hz. 2.1. Source Separation using Convolutive Sparse Coding For source separation, we choose the method of convolutive sparse coding because it solves, at least partially, the assumptions of fixed spectra over time and the model-fitting criterion of the reconstruction error, which are not valid for audio signals. Moreover, this technique uses compression and enables higher perceptual quality of separated sources. The basic signal model in general sparse coding is that each observation vector x i is a linear mixture of source vectors s j : J i i, j j j= 1 where ai, jis the weight of j th source in the i th observation signal. x = a s, i = 1,..., I, (1)

Initial Signal Feature Vector Classifier Source1 Feature Vector 1 Initial Signal Source2 Classifier Feature Vector 2 Fig. 1 Both the source vectors and the weights are assumed unknown. The sources are obtained by multiplying the observation matrix by an estimate of the un-mixing matrix. The main assumption in sparse coding techniques is that the sources are non-active most of the time, which means that the mixing matrix has to be sparse. The estimation can be done using a cost function that minimizes the reconstruction error and maximizes the sparseness of the mixing matrix. More specifically, this method is called convolutive sparse coding because the source model is formulated as the convolution of a source spectrogram and an onset vector. The suitability of this model over-covers the case of respective transient sources. The input signal is represented using the magnitude spectrogram, which is calculated as follows: first, the time domain input signal is divided into frames and windowed with a fixed 40 ms Hamming window with 50% overlap between frames. Second, each frame is transformed into the frequency domain by computing its discrete Fourier transform (DFT) of length equal to the window size. Only positive frequencies are retained and phases are discarded by keeping only the magnitude of the DFT spectra. This results in a spectrogram x f, t, where f is a discrete frequency index and t is a frame index. A two-dimensional magnitude spectrogram is used to characterize one event of a source at discrete frequency f, using t frames as the frame onset varies between 0 and D. 2.1.1. The iterative algorithm. The magnitudes x f, t and weights w f, t are calculated. The number of sources N is set by hand. N should be equal to the number of clearly distinguishable instruments. If the spectrum of one source varies significantly, for example because of accentuation, one may have to use more than one component per source. The model considers the different fundamental frequencies of each instrument as separate sources. Initialize a 1,..., a n and with the absolute values of Gaussian noise. Iteration: 1. Update s f, t using the multiplicative step { p+ 1 } { p } T T T T { p} s = s. ( AWf Wf xf )./( AWf Wf As ) where the s { p+ 1} { p} th is the updated s for p iteration given the AW, f. c ( ) tot λ 2. Calculate an = an 3. Update. an an µ κ an Set the negative elements of a n to zero. µ κ is the step size, which is adaptively set.

4. Evaluate the cost function. 5. Repeat Steps 1-4 until the value of the cost function remains unchanged. In the synthesis mode, the convolutions are evaluated to get frame-wide magnitudes of each source. To get the complex spectrum, phases are obtained from the phase spectrogram of the original mixture signal. The time-domain signal is obtained by inverse discrete Fourier transform and overlap-add. This procedure was found to produce best quality. The use of the original phases allows the synthesis without abrupt changes in phase. 2.2.Feature Extraction We transform an audio signal at a certain level of information granularity. Information granules refer to a collection of data that contain only essential information. Such granulation allows more efficient processing for extracting features and computing numerical representations that characterize a music signal. As a result, the large amount of detailed information of the signal is reduced to a small collection of features. Each feature captures some aspects of the signal and gives essential information about it. In our system, we used a 30-dimensional objective feature vector which was originally proposed by Tzanetakis et al [4] and used in other works [1], [2], [3] [6], [12]. For the extraction of the feature vector, we used MARSYAS 0.1, a public software framework for computer audition applications [5]. The feature vector consists of three different types of features, namely rhythm-related (Beat), timbral texture (musical surface: STFT, MFCCs) and pitch content-related features [5]. 2.2.1. Rhythmic Features. Rhythmic features characterize the movement of music signals over time and contain such information as regularity of the tempo. The feature set for representing rhythm is based on detecting the most silent periodicities of the signal. Rhythm is extracted from the beat histogram, a curve describing beat strength as a function of tempo values and is used to obtain information about the complexity of the beat in the music piece. The regularity of the rhythm, the relation of the main beat to the subbeats and the relative strength of subbeats of the main beat, are used as one of the features in their musical genre recognition system. The Discrete Wavelet Transform (DWT) is used to divide the signal into octave bands and, for each band, full-wave rectification, low pass filtering, down sampling and mean removal are performed in order to extract an envelope. The envelopes of each band are summed up and the autocorrelation is calculated to capture the periodicities in the signal envelope. The dominant peaks in the autocorrelation function are accumulated over the entire audio signal into a beat histogram. 2.2.2. Timbral Texture Features. In short time audio analysis, the signal is broken into small, possibly overlapping temporal segments and each segment is processed separately. These segments are called analysis windows and need to be short enough for the frequency characteristics of the magnitude spectrum to be relatively stable. The term texture window describes the longest window that is necessary to identify music texture. These features are based on the Short Time Fourier Transform and calculated for every analysis windows. Means and standard deviations are calculated over texture window.

2.2.3. Pitch Features. Pitch features describe the melody and harmony information about a music signal. Pitch detection algorithms decompose a signal into two frequency bands and amplitude envelopes extracted for each frequency band. The envelope extraction is performed by applying half-way rectification and low-pass filtering. The envelopes are summed and an enhanced autocorrelation function is computed so that the effect of integer multiples of the peak of frequencies to multiple pitch detection is reduced. The dominant peaks of the autocorrelation function are accumulated into pitch histograms and the pitch content features extracted from the pitch histograms. The pitch content features typically include the amplitudes and periods of maximum peaks in the histogram, pitch intervals between the two most prominent peaks and the overall sums of the histograms. 3. CLASSIFICATION METHOD AND RESULTS We have tried and evaluated different classifiers contained in the machine learning tool called WEKA [12], which we have connected to our system. One of these classifiers is a multilayer perceptron. The network input is the feature vector corresponding to the component signals produced by source separation. The network consists of two hidden layers of neurons. The number of neurons in the output layer is determined by the number of audio classes we want to classify into (six in this work). The network was trained with the back-propagation algorithm and its output estimates the degree of membership of the input feature vector in each of the six audio classes. Thus, the value at each output necessarily remains between 0 and 1. Classification results were calculated using 10-fold cross-validation evaluation, where the dataset to be evaluated was randomly partitioned so that 10% be used for testing and 90% be used for training. This process was iterated with different random partitions and the results were averaged. This ensured that the calculated accuracy was not biased because of the particular partitioning of training and testing. Table 1-Correctly Classified Instances without/with Source Classifier % w/out SS % with SS Nearest-Neighbour Classifier 67.5882 68.1655 MultilayerPerceptron 4 hidden layers 73.2126 74.8946 MultilayerPerceptron 10 hidden layers 73.9752 75.0164 AdaBoostM1 75.8818 77.4978 As seen in Table 1, the results after implementation of the source separation technique had an improvement of 1%- 2%. This was due to the fact that the source separation technique revealed more information about timbral texture, rhythm and pitch (harmony) content, not only for the signal as a whole, but for a number of the separated instrument team sources. 4. CONCLUSIONS AND FUTURE WORK We presented a new approach for automatic musical genre classification inspired by the observation that audio signals corresponding to music of the same genre share certain common

characteristics as they are performed by similar types of instruments and have similar pitch distribution and rhythmic patterns. Our approach was based on classification of the features extracted from signals that correspond to distinct musical instrument sources, as these sources have been identified by a source separation process. Evaluation of the performance of our proposed approach showed improved correct classification results over existing methods. In the future, we will extend further our proposed musical genre classification method and combine it with other audio signal representation tools, such as discrete wavelet transforms. This and related work is currently in progress and its results will be announced shortly. REFERENCES [1] A.S. Lampropoulos, D.N. Sotiropoulos and G.A. Tsihrintzis, Individualization of Music Similarity Perception via Feature Subset Selection, IEEE, International Conference on Systems, Man & Cybernetics 2004, The Hague, Netherlands, October 10-13, 2004. [2] A.S. Lampropoulos, and G.A. Tsihrintzis, Agglomerative Hierarchical Clustering For Musical Database Visualization and Browsing, Proceedings of the 3rd Hellenic Conference on Artificial Intelligence, Samos, Greece, 2004. [3] A.S. Lampropoulos, and G.A. Tsihrintzis, Semantically Meaningful Music Retrieval with Content- Based Features and Fuzzy Clustering, 5th International Workshop on Image Analysis for Multimedia Interactive Services, Lisbon, Portugal, 2004. [4] G. Tzanetakis and P. Cook "Musical Genre Classification of Audio Signals" IEEE Transactions on Speech and Audio Processing, 10(5), July 2002 [5] G. Tzanetakis and P. Cook "MARSYAS: A Framework for Audio Analysis" Organised Sound, Vol.4(3), 2000. [6] K. Kosina, Music Genre Recognition, PhD thesis, Hagenberg, 2002. [7] M. Casey and A. Westner, "Separation of Mixed Audio Sources by Independent Subspace Analysis", in Proceedings of the International Computer Music Conference, ICMA, Berlin, August, 2000. [8] M. D. Plumbley, S. A. Abdallah, J. P Bello, M.E. Davies, G..Monti and M. B Sandler Automatic Music Transcription and Audio Source Separation, Cybernetics and Systems, 33(6), pp. 603-627, 2002. [9] V. Tuomas Separation of Sound Sources by Convolutive Sparse Coding ISCA Tutorial and Research Workshop on Statistical and Perceptual Audio Processing, SAPA 2004. [10] K. Martin Sound-source recognition: A theory and computational mode. PhD Thesis, MIT.1999. [11] E. Wold, T. Blum, and J. Wheaton. Content-based Classification, Search and Retrieval of Audio. IEEE Multimedia, 3(3), pp.27-36, 1996. [12] C.H.L. Costa, J.D. Valle Jr., and A.L. Koerich, Automatic Classification of Audio Data, IEEE, International Conference on Systems, Man & Cybernetics 2004, The Hague, Netherlands, October 10-13, 2004. [13] WEKA: http://www.cs.waikato.ac.nz/ml/weka [14] LAME: http://lame.sourceforge.net