ME 535 Course Project: Analysis of Audio Using PCA

Size: px

Start display at page:

Download "ME 535 Course Project: Analysis of Audio Using PCA"

Martina Marshall
5 years ago
Views:

1 ME 535 Course Project: Analysis of Audio Using PCA SAMEER MESHRAM (M.S.), VARUN RUPCHANDANI (M.S.), VIDUR VIJ (M.S.), June 4, 2018 EXECUTIVE SUMMARY: In this project we perform audio separation through of mixed signal through implementation of Variational Mode Decomposition and Principal Component Analysis (PCA). It is seen that mixture of pure sinusoids of different frequencies is separated very efficiently while that of instruments - cymbals and bass drums are separated satisfactorily. Comparison of original and unmixed (separated) signal is made through various metrics such as correlation coefficient, percentage error, root mean square error, etc. and visually by through of spectrographs. A brief overview of audio acquisition, representation and storage is also provided.

2 1. INTRODUCTION The aim of this report is to provide the reader with a basic understanding of a algorithm used for single channel blind source separation (SCBSS) using variational mode decomposition (VMD) and principal component analysis (PCA). A brief description of audio acquisition, representation and storage is also included in the present section. Section 2 describes the proposed method, results are displayed in section 3 and discussion upon the work done followed by concluding remarks are presented in section 4. The program for algorithm implementation can be found in Appendix. 1A] AUDIO ACQUISITION, REPRESENTATION and STORAGE Audio acquisition is the process of converting the physical phenomenon we call sound into a format suitable for digital processing, the representation is the problem of extracting from the sound information necessary to perform a specific task, and the storage is the task of reducing the number of bits necessary to encode the acoustic signals [1]. A typical sound wave can be represented by: s(t) = A sin(2πft + ) = A sin ( 2π T t + ) where A is called the amplitude and it represents the maximum distance from equilibrium position, φ is the phase, T is called period which is the time interval length between two instances of s(t). Also, f = 1 is the frequency measured in Hz. T The only sounds that are used in audio applications are those that can be perceived by human beings, the intensities can be measured through the ratio I I 0 to the threshold of hearing (THO) I 0, i.e., the minimum intensity that can be detected by human ears. However, this is problematic as the value of I 0 corresponds to W/m 2, while the maximum value of I that can be tolerated without permanent physiological damages is I max = 10 3 W/m 2. The ratio I I 0 can thus range across 15 orders of magnitude and this makes it difficult to manage different intensity values. Hence, the ratio I I 0 is measured using the decibel (db) scale: I = 10 log 10 ( I I o ) where I is the intensity that is measured in db. a] Sampling In the process of sampling, the displacement of measurement membrane is measured at regular time steps. The number F of measurements per seconds is called sampling frequency or sampling rate and

3 the time interval T c = 1 between two consecutive measurements is called sampling period. The F relationship between the analog signal s(t) and the sampled signal s[n] is as follows: s[n] = s(nt c ) A common rule of thumb when deciding upon the sampling frequency can be ascertained by the Nyquist or Shannon frequency, given as F > 2f max, where f max is the highest frequency represented in the sound signal. The following table shows common sampling rates along with bandwidth: Sound Quality Required Bandwidth Sampling Rate Telephone 200 Hz to 3.4 khz 8000 Hz CD Audio 5 Hz to 20 khz Hz DAT Audio 5 Hz to 20 khz Hz DVD Audio 0 Hz to 96 khz Hz DVD Audio with stereomixes 0 Hz to 96 khz Hz Table 1: Common sampling rates and bandwidths b] Audio Representation Since sound is longitudinal pressure wave it consists of continuous values as opposed to digitized ones. Since most signal processing is handled by digital devices it is necessary to digitize sound. Digitization implies conversion to a stream of numbers, preferably integers for increased processing efficiency [2]. Sampling in the amplitude or voltage dimension is called quantization. The most straightforward method to perform such a task is known as pulse code modulation (PCM). PCM involves three steps viz. sampling, quantization (Linear or non-linear) and coding (A-Law or μ-law). The figure below can be used to conceptualize sampling and quantization: Figure 1: Left: Sampling the analog signal in time; Right: Quantization is sampling the analog signal in amplitude dimension [2]

4 c] Audio Storage The number of bits (B) used to represent audio samples has an important role in storage and transmission. The higher the number of bits, the larger is the memory space needed to store a recording and transmit through a channel. The number of bits per unit time necessary to represent a signal is called bit-rate. The area dealing with reducing bit-rate while at the same time preserving quality of the signal is called audio encoding. The primary encoding methods result in audio formats such as MPEG, WAV, mp3, etc. Audio Format Sampling Rate Bit Rate MPEG-1 32 to 48 khz 32 to 384 kbits/s MPEG-2 16 to 24 khz 8 to 144 kbits/s WAV 8 to 48 khz 4.8 to 176 kbits/s Mp3 8 to 44 khz 8 to 320 kbits/s Table 2: Common audio formats with sampling and bit rate 1B] BLIND SOURCE SEPARATION Blind source separation (BSS) is the extraction of single sound sources that comprise a mixture of signals and as such has several applications in bio-medical signal processing, speech processing, speech recognition, communications, etc. The independent source signals are usually multiplied with different weights (mixing matrix) and the summed mixture is recorded. The source signals are expected to be recovered only from the available mixture of the signal with no other background knowledge. This is often compared to the Cocktail Party Effect which is the ability of human brain to focus on specific human voice, while filtering out other voices or backgrounds [3,4]. In this particular text we will concentrate on single channel blind source separation (SCBSS). a] A Brief Review of Blind Source Separation Paradigms Several methods have been proposed for SCBSS in literature. In [5], single channel-independent component analysis (SCICA) has been used to separate the sources. A new method based on wavelet transform along with FastICA has been used in [6]. In [7], the technique of ensemble empirical mode decomposition (EEMD) has been used to decompose a single observation into a number of intrinsic mode functions (IMFs). Principle component analysis (PCA) was used to select the required number of IMFs to select the required number of IMFs to be given as input to ICA algorithm. A learning framework has been applied to BSS in [8], which treats BSS as a generalized eigenvalue problem. Our project deals with SCBSS employing variational mode decomposition (VMD) and PCA. b] Variational Mode Decomposition VMD decomposes a signal into a number of modes concurrently and non-recursively such that the combination of all the modes reconstruct the original signal. It also searches for the center frequency of the modes and each mode is band-limited about this center frequency. It decomposes a real valued signal into a finite number of modes simultaneously. These modes have specific sparsity

5 properties. The center frequency is also determined along with the modes. Each of the modes are compact around its center frequency. The bandwidth of each mode and reconstruction fidelity are controlled by different parameters. Details of VMD implementation can be found in section 2. c] Principal Component Analysis PCA identifies the dynamics and behavior of a system from a seemingly complex and incoherent set of data. Even without any prior knowledge of governing systems we can produce low level reductions. This procedure is called the Principal Component Analysis (PCA), Proper Mode Decomposition (POD) or the Karhunen-Loeve Decomposition. PCA can be done by eigenvalue decomposition of a data covariance matrix or singular value decomposition of a data matrix, usually after mean centering and normalizing the data matrix for each attribute. Figure 2: Representation of principal component analysis (PCA) If we imagine a random set of n-dimensional data, then PCA can be visualized as transforming that data into a n-dimensional ellipsoid with its axes serving as each of the principal components. When the axis of the ellipsoid is small, then the variance along that axis is also small, and by omitting that axis and its corresponding principal component from our representation of the dataset, we lose only a small amount of information. Thus, we can achieve reduction and identification of data from PCA.

6 2. PROPOSED METHOD In this work, a mixture of two sources x 1 (t) and x 2 (t) has been considered. The single channel containing the mixture of these two sources is given by: x(t) = a 1 x 1 (t) + a 2 x 2 (t) + n(t) Here, a 1 and a 2 are weighting coefficients and n(t) is any noise that may be introduced in the system. Our objective is to extract x 1 (t) and x 2 (t). The proposed method consists of applying VMD on x(t).and thus decomposing it into n number of modes. Each mode belongs to a different spectral band. If the signal x(t) is a mixture of source signals belonging to different spectral bands, then they are captured by the different modes in VMD. Not only that, the outband high frequency noise gets captured in other modes. If the noisy modes are rejected, then the remaining modes are the extracted source signals. For selecting the required signals modes PCA is used. PCA transforms a set of observed correlated vectors to set of uncorrelated vectors. The transformation is as follows: y = A T x where, A is known as orthogonal matrix, x is the set of correlated vectors and y is the set of uncorrelated vectors. The elements of the vector set are called the principal components. These principal components of y are maximized under the constraint that their principal component vectors are orthogonal to each other. Only the first m principal components (where m is the number of source signals), corresponding to the highest eigenvalues are selected. The eigenvalues corresponding to the modes containing noise would be much less as compared to the ones containing the source signals. Thus, PCA helps to select the m modes which directly gives the source components. Algorithm for the proposed method 1. Input observed signal x(t) 2. Apply VMD on x(t) to get n modes : M 1, M 2, - up to M n 3. Apply PCA on M 1 to M n to select m principal components : P 1, P 2,- up to P m 4. Extracted sources : P 1, P 2,- up to P m Table 3: Algorithm of the proposed method 2A] SIGNAL CONSTRUCTION a] Pure Sinusoids We first use pure sinusoids of different frequencies to test the accuracy of the proposed method. In this implementation x 1 (t) = sin (3t) and x 2 (t) = sin (7t). The below figure shows the individual

7 signals and their mixture. Figure 3: Pure sinusoids and their mixture b] Instruments - Cymbal and Bass Drum Signals for Cymbals and Bass are shown below. These instruments are chosen as they operate in nonoverlapping frequency range. Cymbals operate in the range 3 to 5 khz and bass drums between 60 to 100 Hz. Figure 4: Sound signal for cymbal, bass drum and their mixture

8 2B] ALGORITHM IMPLEMENTATION Variational mode decomposition is applied on the single channel mixed signal x(t) to decompose it into K band separated modes. For each type of signal K is kept 2 as it is believed to a combination of two dominant modes. The function VMD() contributed by [9,10] is used for this purpose. Principal components of the K mode is found out by implementing inbuilt Matlab function pca(). The signal reconstruction can be obtained by using the following equation: PCA reconstruction = PC scores Eigenvectors T + Mean Results obtained from the implementation of the above algorithm are summarized in section 3.

9 3. RESULTS The algorithm as discussed in section 2 is implemented with the following results. a] Separation of Pure Sinusoids The following figure are obtained for separation of pure sinusoidal mixed signals. Data is shown in both in terms of amplitude and frequency (spectrogram) versus time for comparison. Figure 5: Separated and original sinusoidal signal 1. Figure 6: Spectrogram of separated and original sinusoidal signal 1.

10 The following metrics were calculated to determine the efficiency of the separation algorithm: For Signal 1 Correlation coefficient Percentage error Root mean square error Table 4: Metrics for separated and original signal 1 Figure 7: Separated and original sinusoidal signal 2. Figure 8: Spectrogram of original and separated sinusoidal signal 2.

11 The following metrics were calculated to determine the efficiency of the separation algorithm: For Signal 2 Correlation coefficient Percentage error Root mean square error Table 5: Metrics for separated and original signal 2 b] Separation of Cymbal and Bass Drum The following figure are obtained for separation of pure sinusoidal mixed signals. Data is shown in both in terms of amplitude and frequency (spectrogram) versus time for comparison. Figure 9: Separated and original cymbal

Correlation coefficient 0.6794 Percentage error 43.9026 Root mean square error 0.

12 Figure 10: Spectrogram of original and separated cymbal The following metrics were calculated to determine the efficiency of the separation algorithm: For Cymbal Correlation coefficient Percentage error Root mean square error Table 6: Metrics for separated and original Cymbal signal Figure 11: Separated and original bass drum

13 Figure 12: Spectrogram of original and separated bass drum The following metrics were calculated to determine the efficiency of the separation algorithm: For Bass Drum Correlation coefficient Percentage error Root mean square error Table 7: Metrics for separated and original Bass Drum signal 4. DICSUSSION and CONCLUSION As we can see for the results section 3 pure sinusoids are separated very efficiently by the mentioned algorithm. This behavior is anticipated as they are created through addition of pure sinusoids comprising of one dominant frequency each. However, the separation of Cymbals and Bass Drum is not very efficient. This can be attributed to the presence of several frequency modes and noise that might be introduced during recording of Cymbals and Bass Drum. The algorithm thus fairly performs separation and has scope for improvement.

14 REFERENCES [1] F. Camastra, A. Vinciarelli, Machine Learning for Audio, Image and Video Analysis, Advanced Information and Knowledge Processing, Springer-Verlag London 2015, Accessed: 5/22/2018 4:00 PM. [2] Z-N. Li, M. Drew, J. Liu, Fundamentals of Multimedia, Prentice Hall [3] P. Dey, U. Satija, B. Ramakumar, Single Channel Blind Source Separation Based on Variational Mode Decomposition and PCA, IEEE INDICON 2015 [4] Deep Learning Machine Solves the Cocktail Party Problem, Online: Accessed: 5/25/2018 3:28 PM [5] M. E. Davies and C. J. James, Source separation using single channel ICA, Signal Process [6] M-G Lopez, H. Lozano, L.P. Sanchez and L.N.O. Moreno, Blind Source Separation of audio signals using independent component analysis and wavelets, Int. Conf. Electr. Commun. Comput., [7] Y. Guo, S. Huang, Y. Li, Single-Mixture source separation using dimensionality reduction of ensemble empirical mode decomposition and independent component analysis, Circuits Syst. Signal Process., [8] H. Liu, Y. Cheung, A learning framework for blind source separation using generalized eigenvalues, Lecture Notes in Comput. Sci. Vol., [9] D. Zosso, Variational Mode Decomposition, Online: Accessed: 5/26/2018 5:18 PM [10] K. Dragomiretskiy, D. Zosso, Variational Mode Decomposition, IEEE Trans. Signal Processing, 2014

15 APPENDIX % ME 535 Project % Analysis of Audio using PCA clear all;close all, clc t = linspace(0, 6000, 100); fs = 1; %44100 s1 = sin(3*t)'; %Trying to seperate this and s2 %[s,fs] = audioread('cymbal_recording_clip.mp3'); %s_2 = s(:,1); s2 = sin(7*t)'; %s_2(3000:3999)'; mix = s1'+ s2'%0.3*s1'+ 0.7*s2'; figure(1) subplot(3,1,1) plot(s1,'-b'); title('signal 1: x_1 (t)') subplot(3,1,2) plot(s2,'-b'); title('signal 2: x_2 (t)') ylabel('amplitude'); subplot(3,1,3) plot(mix,'-b'); title('source Mixture: x(t) = 0.3x_1 (t) + 0.7*x_2 (t)'); xlabel('time (t)'); Implementation of VMD alpha = 2000; tau = 0; K = 2; DC = 0; init = 1; tol = 1e-7; % moderate bandwidth constraint % noise-tolerance (no strict fidelity enforcement) % K modes % no DC part imposed % initialize omegas uniformly [u, u_hat, omega] = VMD(mix, alpha, tau, K, DC, init, tol) % u - the collection of decomposed modes % u_hat - spectra of the modes % omega - estimated mode center-frequencies figure(2) plot(u); title('u') figure(3) plot(u_hat);title('u-hat') figure(4) plot(omega); title('omega')

16 Implementation of PCA and Signal Reconstruction close all; [coeff, score, latent, tsquared, explained, mu] = pca(u); %coeff = pca(u') recon = score*coeff'+repmat(mu,k,1) figure(5) plot(s1,'-b', 'Linewidth',1.1) hold on;xlabel('') plot (recon(1,:),'--r', 'Linewidth',1.5); xlabel('time (t)');ylabel('amplitude'); title('signal 1') legend('original', 'Separated') figure(6) plot(s2,'-b', 'Linewidth',1.1) hold on; plot (recon(2,:),'--r', 'Linewidth',1.5); xlabel('time (t)');ylabel('amplitude'); title('signal 2') legend('original', 'Separated') Spectrogram of Original Signal (s1) figure(7) spectrogram(s1, 4, 3/4*4, [], fs, 'yaxis') box on set(gca, 'FontName', 'Times New Roman', 'FontSize', 14) xlabel('time, s') ylabel('frequency, Hz') title('spectrogram of sin(3t)-original') h = colorbar; set(h, 'FontName', 'Times New Roman', 'FontSize', 14) ylabel(h, 'Magnitude, db') Spectrogram of Reconstructed Signal (s1) figure(8) spectrogram(recon(1,:), 1024, 3/4*1024, [], fs, 'yaxis') box on set(gca, 'FontName', 'Times New Roman', 'FontSize', 14) xlabel('time, s') ylabel('frequency, Hz') title('spectrogram of sin(3t)-unmixed') h = colorbar; set(h, 'FontName', 'Times New Roman', 'FontSize', 14) ylabel(h, 'Magnitude, db')

17 Spectrogram of Original Signal (s2) figure(9) spectrogram(s2, 1024, 3/4*1024, [], fs, 'yaxis') %1024 or 4 box on set(gca, 'FontName', 'Times New Roman', 'FontSize', 14) xlabel('time, s') ylabel('frequency, Hz') title('spectrogram of sin(7t)-original') h = colorbar; set(h, 'FontName', 'Times New Roman', 'FontSize', 14) ylabel(h, 'Magnitude, db') Spectrogram of Unmixed Signal (s2) figure(10) spectrogram(recon(2,:), 1024, 3/4*1024, [], fs, 'yaxis') box on set(gca, 'FontName', 'Times New Roman', 'FontSize', 14) xlabel('time, s') ylabel('frequency, Hz') title('spectrogram of sin(7t)-unmixed') h = colorbar; set(h, 'FontName', 'Times New Roman', 'FontSize', 14) ylabel(h, 'Magnitude, db') Metrics for comparison R_1 = corrcoef(s1,recon(1,:)) R_2 = corrcoef(s2,recon(2,:)) Norm_s1 = norm(s1-recon(1,:)); per_err = 100*((spectrogram(s1)-spectrogram(recon(1,:))))./spectrogram(s1); per_err_abs = abs(per_err); per_err_vector = per_err_abs(:) %RMS error signal RMSE = sqrt(mean((s1 - recon(1,:)).^2)) %RMS error spectrogram RMSE = mean(abs(sqrt(mean((spectrogram(s1) - spectrogram(recon(1,:))).^2)))) %mean_percentage_error = mean(per_err_vector) Published with MATLAB R2017b

Drum Transcription Based on Independent Subspace Analysis

Report for EE 391 Special Studies and Reports for Electrical Engineering Drum Transcription Based on Independent Subspace Analysis Yinyi Guo Center for Computer Research in Music and Acoustics, Stanford,