Report for EE 391 Special Studies and Reports for Electrical Engineering Drum Transcription Based on Independent Subspace Analysis Yinyi Guo Center for Computer Research in Music and Acoustics, Stanford, CA hugo@ccrma.stanford.edu ABSTRACT In automatic music transcription, metadata extraction from recorded audio data or speaker separation in video conferencing, it is a significant prerequisite task to analyze and separate the audio signal into their original source components. In this report, I study and analyze a set of methods of the extraction of percussive instruments metadata from polyphonic music. It mainly focuses on the stage of audio source separation, which consists of methods of Principal Components Analysis, Independent Components Analysis, and Non-negative Matrix Factorization. With this spectrogram decomposition method, different samples of music have been analyzed. The results show very encouraging when considering the extraction of non-pitch information rather than perfect note-to-note transcription. 1. MOTIVATION Rhythm is an essential concept for musical structure, and drum scores are prerequisite for further high level description of any rhythmical content, in that percussive instruments contribute to the rhythmical impression. Drum transcription is described by means of symbolic metadata including onsets timing and the types of drum. This information of rhythmical patterns enables further categorization of musical content such as genre classification and music moods analysis. Also, the measurement of less subjective music elements like tempo and musical meter significantly benefits from the availability of a drum score as well. In addition, the techniques of drum replacement in audio and automatic generation of drum score from recorded music have become increasingly popular in current musical entertainment industries like video games and iphone applications. Thus, automated transcription of the drum score is able to contribute to today s music retrieval algorithms immensely, and stimulates the development of varieties of applications in audio industry as well.
2. SYSTEM OVERVIEW PCM Audio Signals Time Frequency Transformation Peaks Picking & Onsets Detection Principal Components Analysis (PCA) Non-negative Independent Components Analysis (ICA) Non-negative Matrix Factorization (NMF) Features Extraction Sources Classification & Onsets Acceptance Training Sources Spectral Profiles Symbolic Data transcription & Midi Synthesis Drum Scores Figure 1 Drum Transcription System Overview An overview of the drum transcription system is presented in figure 1. The digital audio signals used for further signal processing chain are mono files with 16bit per sample at a sampling frequency of 44.1 khz. A spectral representation of the pre-processed time signal is computed using a Short Time Fourier Transformation (STFT). After the process of differentiation and half-wave rectification of the magnitude spectrogram, a non-negative difference-spectrogram is computed for further processing. Then, the detection of multiple local maxima associated with transient onset events in the musical signal is conducted in a basic peak picking method. The main concept of the further process is the storage of a short excerpt of the difference-spectrogram at the time of the onset t. From these onset frames the significant spectral profiles will be gathered in the next stages of PCA for dimensionality reduction and ICA for audio source separation. The
subsequent sections will give a more in depth account of the source separation stage endorsed the whole signal processing chain. 3. AUDIO SOURCE SEPARATION 3.1 Principal Components Analysis (PCA) Principal Components Analysis is a way of identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences. Since patterns can be difficult to find in high dimensional data, such graphical, audio or video representation, PCA is a powerful method for analyzing data. The other advantage is PCA can reduce the dimensionality of the original data without losing much information. From the preceding steps the information about the time of occurrence t and the spectral composition of the onsets t is deduced. In order to find only a limited number of significant pattern subspaces of the deduced high dimensional onsets spectra, Principal Component Analysis (PCA) is applied to reduce the dimensionality of the percussive sources. By using PCA, the whole set of collected spectra can be broken down to a relatively small number of de-correlated principal components, thus resulting in a good representation of the original spectra with small reconstruction error. First, we calculate the covariance matrix of the original spectra vectors subtracted by their means. Next we compute the covariance matrix s eigenvectors which is orthogonal to each others, and their identical eigenvalues. From the set of eigenvectors the ones related to the d largest eigenvalues are chosen to provide the coefficients for a linear combination of the original spectra vectors according to equation X X t T, where T describes a transformation matrix for the dimensionality reduction, since it turns out that the eigenvectors with the highest eigenvalues is the principle components of the data set. The components in X are decorrelated to each others and are also variance normalized. It can be subsequentially put into the ICA stage in the next section. 3.2 Non-negative Independent Components Analysis (ICA) Non-negative Independent Components Analysis is one of the approaches of Independent Component Analysis to separate a set of linear mix signals into their original sources. A requirement for optimum performance of the algorithm is the statistical independency of the sources, which can be satisfied by the process of PCA. Non-Negative ICA uses the very intuitive concept of optimizing a cost function describing the non-negativity of the components. This cost function is related to the reconstruction error introduced by axis pair rotations of two or more variables in the positive quadrant of the joint probability density function [2]. There are two assumptions for this ICA algorithm: the original source signals are both positive and to some extent linearly independent. The first constraint is always fulfilled because the vectors processed ICA are from the amplitude-spectrogram X, which has been differentiated
half-wave rectified in early stage, so it does not contain any negative values. For the second constraint, the spectra collected at onset times can be regarded as the superposition of a small set of original source spectra representing the involved percussive instruments. It may safely be assumed that there are some characteristic properties inherent to spectral profiles of drum sounds [4][1] that allow us to separate the whitened components into their potential percussive sources F according to equation F A X, where A denotes the d x d un-mixing matrix iterately estimated by the ICA optimization process[2], which actually separate the individual components X. The sources vectors F are named spectral profiles [4]. The original spectral profiles of involved percussive instruments for training reference are shown in figure 2, and the spectral profiles for one particular input sample are shown in figure 3. figure 2 spectral profiles reference figure 3 spectral profiles of input sample 3.3. Extraction of Amplitude Bases After computing a certain number of spectral profiles, they can be used to extract the spectrograms amplitude basis from here forward referred to as amplitude envelopes according to equation E F X, there is no further ICA computation applying on the extraction of the amplitude envelopes. But actually the extracted amplitude envelopes E, do offer decent detection functions with peaks and plateaus for the following detection and classification stage.
Figure 4 Extracted Amplitude Envelopes 3.4 Non-negative Matrix Factorization (NMF) Non-negative Matrix Factorization (NMF) [3] is another approach of Independent Component Analysis successfully used in several unsupervised learning tasks and also in the analysis of music signals. It is an identical method for computing the sources spectral profiles and amplitude envelopes. In the case of music signals, NMF has been used to separate the input signal into a sum of sources, each of which has a fixed spectrum and a time-varying gain. This model suits quite well for representing drum signals. The signal model for spectrum X t ( f ) in frame t can be N t ( n, t n 1 written as a weighted sum of source spectra S n ( f ), X f ) a S ( f ), where the ( f ) actually the spectral profiles of the involved drum sounds deducted from the ICA computation, n S n is and a n, t are identical to the extracted amplitude envelopes discussed in last section, as percussive events detection functions with peaks and plateaus. a, In NMF, both the spectra S n ( f ) and gains n t are restricted to be non-negative. In the case of audio source separation, this means the spectrograms are purely additive. It has turned out that the non-negativity constraint alone is sufficient for separating sources [5]. The method of NMF is similar to the computation of ICA, minimizing a cost function between the observed spectrum and the model to converge. The divergence is minimized by an iterative algorithm, which uses multiplicative updates [3]. The main difference between ICA and NMF is that in NMF both spectral profiles and gain envelops are obtained by iterative estimation algorithm. 3.5 Components Classification and Onsets Acceptance The assignment of spectral profiles to the pre-trained profiles of drum instruments is provided by a simple k-nearest neighbor classifier with spectral profiles of single percussive instruments as training database. The distance function is calculated from the correlation coefficient between reference profile and incoming profile. Drum-like onsets events are detected in the amplitude envelopes by using traditional peak picking approach. The value of the amplitude envelope s magnitude is assigned to every onset candidate at its position. If this value exceeds a predetermined adaptive threshold then the onset is accepted. The threshold varies over time according to the amount of energy in a relatively larger range surrounding the onsets. 4. RESULTS To quantify the abilities of the algorithm, the ground truth drum scores of 20 excerpts were extracted from identical midi files as a reference. Each excerpt consists of 40 seconds duration at 44.1 khz sampling rate and 16 bits quantization resolution. Different musical genres are contained among these examples featuring rock, pop, latin, and rap. They were chosen because of their distinct musical characteristics, and the intention to confront the system with a significant variety
of possible percussive instruments and sounds. In this research, we only consider a limited number of percussive instruments, including bass drum, snare drum, hi-hat, cymbal and tom. The results shown in Table 1 is based on ICA algorithm (NMF is under implementation), and they are evaluated by typical statistical method of recall rate, precision rate and F-score. Class Precision Rate Recall Rate F-Score Bass Drum 90% 93% 91% Snare Drum 82% 88% 85% Hi-Hat 74% 85% 79% Cymbal 68% 73% 70% Tom 73% 84% 78% Table 1 Drum Transcription Results From the results above, we found that the recall rate is basically better than the precision rate, which means the detection system is over sensitive so that adds more non-drum events. In addition, the results of high frequency instruments like hi-hat and cymbals are less accurate in contrast to lower sounding drums. Because the presence of very prominent and dynamical harmonic sustained instruments like expressive singing voice or guitar solos tends to affect the purity of the separated sources, so that increase the number of found onsets. 5. CONCLUSIONS This report presented a source separation method of independent subspace analysis for automatic detection and classification of percussive instruments in recorded audio signals. The results are promising when considering the extraction of non-pitch information rather than perfect note-to-note transcription. Further improvements will be made with regards to the stage of classification and the onset acceptance by seeking more adaptive methods. In addition, training data of the spectral profiles have to be improved by using larger and more standard datasets. 6. REFERENCES [1] C. Uhle, C. Dittmar and T. Sporer, Extraction of Drum Tracks from polyphonic Music using Independent Subspace Analysis, in Proc. of the Fourth International Symposium on Independent Component Analysis, Nara, Japan, 2003 [2] M. Plumbley, Algorithms for Non-Negative Independent Component Analysis, in IEEE Transactions on Neural Networks, 14 (3), pp 534-543, May 2003 [3] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13, pages 556 562. MIT Press, 2001. [4] C. Dittmar, C. Uhle, Further Steps towards Drum Transcription of Polyphonic Music, Proc. Of the AES 116th Convention, Berlin, 2004 [5] J. Paulus and T. Virtanen, Drum Transcription with non-negative spectrogram factorisation, submitted in EUSIPCO 2005, Antalya, Turkey, Sept. 4-8. 2005.