AutoScore: The Automated Music Transcriber Project Proposal , Spring 2011 Group 1

Similar documents
Drum Transcription Based on Independent Subspace Analysis

Discrete Fourier Transform (DFT)

Basic Signals and Systems

Digital Video and Audio Processing. Winter term 2002/ 2003 Computer-based exercises

ECEn 487 Digital Signal Processing Laboratory. Lab 3 FFT-based Spectrum Analyzer

Lab 3 FFT based Spectrum Analyzer

The Discrete Fourier Transform. Claudia Feregrino-Uribe, Alicia Morales-Reyes Original material: Dr. René Cumplido

ADSP ADSP ADSP ADSP. Advanced Digital Signal Processing (18-792) Spring Fall Semester, Department of Electrical and Computer Engineering

AC : INTERACTIVE LEARNING DISCRETE TIME SIGNALS AND SYSTEMS WITH MATLAB AND TI DSK6713 DSP KIT

Applications of Music Processing

Transcription of Piano Music

Performing the Spectrogram on the DSP Shield

EE 464 Short-Time Fourier Transform Fall and Spectrogram. Many signals of importance have spectral content that

DSP First. Laboratory Exercise #11. Extracting Frequencies of Musical Tones

Aberehe Niguse Gebru ABSTRACT. Keywords Autocorrelation, MATLAB, Music education, Pitch Detection, Wavelet

Contents. Introduction 1 1 Suggested Reading 2 2 Equipment and Software Tools 2 3 Experiment 2

THE CITADEL THE MILITARY COLLEGE OF SOUTH CAROLINA. Department of Electrical and Computer Engineering. ELEC 423 Digital Signal Processing

Computer Generated Melodies

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Sampling and Reconstruction of Analog Signals

Signal Processing First Lab 20: Extracting Frequencies of Musical Tones

Modern spectral analysis of non-stationary signals in power electronics

BEAT DETECTION BY DYNAMIC PROGRAMMING. Racquel Ivy Awuor

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

DFT: Discrete Fourier Transform & Linear Signal Processing

DSP First Lab 03: AM and FM Sinusoidal Signals. We have spent a lot of time learning about the properties of sinusoidal waveforms of the form: k=1

Audio Fingerprinting using Fractional Fourier Transform

Speech Enhancement: Reduction of Additive Noise in the Digital Processing of Speech

Two-Dimensional Wavelets with Complementary Filter Banks

DSP First Lab 08: Frequency Response: Bandpass and Nulling Filters

Combining Pitch-Based Inference and Non-Negative Spectrogram Factorization in Separating Vocals from Polyphonic Music

Discrete-time Signals & Systems

Signal Analysis. Peak Detection. Envelope Follower (Amplitude detection) Music 270a: Signal Analysis

Audio Restoration Based on DSP Tools

Biomedical Signals. Signals and Images in Medicine Dr Nabeel Anwar

Lab S-3: Beamforming with Phasors. N r k. is the time shift applied to r k

Topic 2. Signal Processing Review. (Some slides are adapted from Bryan Pardo s course slides on Machine Perception of Music)

Recall. Sampling. Why discrete time? Why discrete time? Many signals are continuous-time signals Light Object wave CCD

Lab P-4: AM and FM Sinusoidal Signals. We have spent a lot of time learning about the properties of sinusoidal waveforms of the form: ) X

Guitar Music Transcription from Silent Video. Temporal Segmentation - Implementation Details

An Approximation Algorithm for Computing the Mean Square Error Between Two High Range Resolution RADAR Profiles

Audio Imputation Using the Non-negative Hidden Markov Model

Discrete-time Signals & Systems

Signal Processing Toolbox

University of Colorado at Boulder ECEN 4/5532. Lab 1 Lab report due on February 2, 2015

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Short-Time Fourier Transform and Its Inverse

Onset Detection Revisited

Introduction to Digital Signal Processing (Discrete-time Signal Processing)

Outline. Introduction to Biosignal Processing. Overview of Signals. Measurement Systems. -Filtering -Acquisition Systems (Quantisation and Sampling)

AC : FIR FILTERS FOR TECHNOLOGISTS, SCIENTISTS, AND OTHER NON-PH.D.S

POLYPHONIC PITCH DETECTION BY MATCHING SPECTRAL AND AUTOCORRELATION PEAKS. Sebastian Kraft, Udo Zölzer

Automatic Transcription of Monophonic Audio to MIDI

FPGA implementation of DWT for Audio Watermarking Application

Laboratory Assignment 4. Fourier Sound Synthesis

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Wavelet Transform. From C. Valens article, A Really Friendly Guide to Wavelets, 1999

(i) Understanding of the characteristics of linear-phase finite impulse response (FIR) filters

Grid Power Quality Analysis of 3-Phase System Using Low Cost Digital Signal Processor

Real-time digital signal recovery for a multi-pole low-pass transfer function system

I-Hao Hsiao, Chun-Tang Chao*, and Chi-Jo Wang (2016). A HHT-Based Music Synthesizer. Intelligent Technologies and Engineering Systems, Lecture Notes

Automatic Guitar Chord Recognition

DETECTION AND CLASSIFICATION OF POWER QUALITY DISTURBANCES

From Fourier Series to Analysis of Non-stationary Signals - VII

Signal Processing in Mobile Communication Using DSP and Multi media Communication via GSM

Topic. Spectrogram Chromagram Cesptrogram. Bryan Pardo, 2008, Northwestern University EECS 352: Machine Perception of Music and Audio

Electrical and Telecommunication Engineering Technology NEW YORK CITY COLLEGE OF TECHNOLOGY THE CITY UNIVERSITY OF NEW YORK

(i) Understanding of the characteristics of linear-phase finite impulse response (FIR) filters

VIBRATO DETECTING ALGORITHM IN REAL TIME. Minhao Zhang, Xinzhao Liu. University of Rochester Department of Electrical and Computer Engineering

The Application of Genetic Algorithms in Electrical Drives to Optimize the PWM Modulation

Lecture 5: Pitch and Chord (1) Chord Recognition. Li Su

Princeton ELE 201, Spring 2014 Laboratory No. 2 Shazam

Digital Signal Processing Lecture 1 - Introduction

Introduction to DSP ECE-S352 Fall Quarter 2000 Matlab Project 1

Toward Automatic Transcription -- Pitch Tracking In Polyphonic Environment

FREQUENCY DOMAIN SYSTEM IDENTIFICATION TOOLBOX FOR MATLAB: AUTOMATIC PROCESSING FROM DATA TO MODELS

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

Digital Signal Processing Lecture 1

IMPROVED CHANNEL ESTIMATION FOR OFDM BASED WLAN SYSTEMS. G.V.Rangaraj M.R.Raghavendra K.Giridhar

y(n)= Aa n u(n)+bu(n) b m sin(2πmt)= b 1 sin(2πt)+b 2 sin(4πt)+b 3 sin(6πt)+ m=1 x(t)= x = 2 ( b b b b

ARM BASED WAVELET TRANSFORM IMPLEMENTATION FOR EMBEDDED SYSTEM APPLİCATİONS

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

Experiment Guide: RC/RLC Filters and LabVIEW

Design Guidelines using Selective Harmonic Elimination Advanced Method for DC-AC PWM with the Walsh Transform

Lecture 7 Frequency Modulation

New Windowing Technique Detection of Sags and Swells Based on Continuous S-Transform (CST)

DIGITAL SIGNAL PROCESSING (Date of document: 6 th May 2014)

DCSP-10: DFT and PSD. Jianfeng Feng. Department of Computer Science Warwick Univ., UK

DSP First, 2/e. LECTURE #1 Sinusoids. Aug , JH McClellan & RW Schafer

Speech/Music Change Point Detection using Sonogram and AANN

Introduction to Wavelet Transform. Chapter 7 Instructor: Hossein Pourghassem

Orthonormal bases and tilings of the time-frequency plane for music processing Juan M. Vuletich *

Armstrong Atlantic State University Engineering Studies MATLAB Marina Sound Processing Primer

GEORGIA INSTITUTE OF TECHNOLOGY. SCHOOL of ELECTRICAL and COMPUTER ENGINEERING

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Introduction to Wavelets. For sensor data processing

Multirate Digital Signal Processing

George Mason University Signals and Systems I Spring 2016

Lecture 6. Rhythm Analysis. (some slides are adapted from Zafar Rafii and some figures are from Meinard Mueller)

IMPROVING ACCURACY OF POLYPHONIC MUSIC-TO-SCORE ALIGNMENT

Transcription:

AutoScore: The Automated Music Transcriber Project Proposal 18-551, Spring 2011 Group 1 Suyog Sonwalkar, Itthi Chatnuntawech ssonwalk@andrew.cmu.edu, ichatnun@andrew.cmu.edu May 1, 2011 Abstract This project works on developing an automatic music transcription system for a single instrument throughout its entire chromatic range. In this project, we train a transcription system for a keyboard using a non-negative matrix factorization method as referenced in [3]. The preliminary testing was performed in MATLAB, then reimplemented on the TI TMS320C6713B Digital Signal Processor (DSP). The final implementation was done in real-time primarily on the DSP with a Graphical User Interface (GUI) on a Macintosh-based computer. The DSP and Mac were connected through a networking interface that transferred note data in real-time to the Mac. Problem Music transcription is the process of converting raw music signals into a musical score. Automated musical transcription can help musicians create sheet music as well as serve as an educational tool for amateurs. Manually transcribing music requires significant skill and time commitment from musicians. Currently, it is difficult for computers to transcribe music as well. This is due to the fact that modern music contains multiple instruments with multiple notes being played simultaneously (polyphony). Many methods have been developed to transcribe music from a single instrument, including bayesian-based methods [1] and even genetic algorithms [2]. Our project implements a recently proposed method that uses a non-negative matrix factorization technique to perform real-time music transcription. Solution Our solution uses a recently developed method for real-time music transcription of music as described in [3]. We use a CTK-591 Casio Keyboard to train and test the music transcription system. The system block diagram can be seen in Figure 1. The system consists of multiple parts. First, the system was trained on musical note samples from the keyboard. This was performed off-line, meaning it was completed before any testing was done and was not part of the real-time system. Training the system of the note templates consisted of obtaining the short-time Fourier Transform 1

Figure 1: Block diagram of our music transcription system [3]. (STFT) of each of the musical note inputs, then performing Non-negative matrix factorization (NMF) on the spectrogram representation obtained from the STFT. The Non-negative matrix factorization produced note templates w (k) for each of the k music samples. This process was performed for each note on the keyboard and the resulting w (k) s were stacked into a matrix representation W. This completed the training phase of our musical transcription system. The training phase was implemented on a Mac in MATLAB in order to speed up the training process. The testing phase of our system was performed in real-time with most of the work done on the TI TMS320C6713B Digital Signal Processor (DSP). The DSP obtained new musical input data at short time intervals and calculated the Fourier Transform (FT) of those signals. For the purposes of notation, we label the magnitude of the FTs of these signals v j. For each v j, the DSP performs correlation against the template matrix W to compute the musical note activations h j using the template dictionary W that was trained in the previous step. This is represented by the following equation: h j W v j These activations determine whether or not a specific note is being played. Note that we perform pre-processing on the training notes in order to max-normalize the w (k) of each training sample. In addition, we perform filtering and thresholding on the activations h j as part of post-processing the data. In the next section, we cover the mathematical background of our system. 2

Background Short-Time Fourier Transform (STFT) The short-time Fourier Transform, or STFT, is a Fourier-related transform that is used to determine the sinusoidal frequency and phase content of local sections of a signal as it changes over time [4]. In the discrete-time STFT, the signal is broken into chunks by a window function w[n]. Each chunk is then Fourier transformed. This can represented by the following equation: ST F T {x[n]} = X(m, ω) = n= x[n]w[n m]e jωn Here m represents the shift of the window in time, while ω represents the frequency. The spectrogram is represented as the magnitude of the STFT [4]. Spectrogram{x(t)} = X(m, ω) 2 An example of a Short Time Fourier Transform can be seen in Figure 2. The window function used when computing the STFT was a hamming window, as defined by the following equation: w[n] = 0.54 0.46 cos(2π n N ), 0 n N In our implementation, the window length was equal 4096. An example of the hamming window function can be seen in Figure 3. Non-negative Matrix Factorization (NMF) Non-negative Matrix Factorization (NMF) is a process that aims to factorize an n x m non-negative matrix V into an n x r non-negative matrix W and an r x m non-negative matrix H. Here r is a positive integer less than n and m [3,6]. r is called the rank of factorization. This will produce an approximation of V such that: V W H The problem in solving for the NMF of a matrix is to find a goodness of fit measure called the cost-function. The standard cost function uses a Euclidean Distance measure. This makes the problem of solving for the NMF a minimization problem of the function: 1 V W H 2 2 The method used to solve for this equation has been extensively studied. To compute the W and H matrices, the iterative multiplicative updates algorithm, introduced in [5], is used. In [8], Lee and Seung provide proofs as to why the algorithm works. The updates for the Euclidean Distance metric are as follows: 3

Figure 2: Short-Time Fourier Transform Example. The x-axis represents time domain (seconds), the y-axis represents frequency (Hertz). Figure 3: 4096-point Hamming Window 4

W W V HT W HH T H H W T V W T W H Where is the element-wise multiplication of matrices and the division is the elementwise division of the matrices. The rank used in our implementation is r = 1. This is due to the fact that we are using vectors for each training template. These vectors are later stacked into a matrix dictionary W. Correlation Method In the real-time testing phase, it is necessary to compute the correlation of the magnitude of the Fourier representation v j. This method was chosen because it is extremely efficient and simple to implement on the DSP. The correlation method can be represented by the following equation: h j W v j Alternative methods to use for performing a similar computation would involve using a distance metic to determine the correspondence between the template vectors in W and the magnitude of the Fourier representation v j. These methods will be described in the future work section. What we implemented Database For our project, we created our own database of training samples from the CTK-591 Casio Keyboard. We created a database of 61 musical note samples from each of the keys of the Casio Keyboard. These were used in the training phase of our solution. Testing We test our data by performing error calculations on musical samples played on the Casio Keyboard. We compute an error between our transcriptions and the actual note played. The total error is a combination of the substitution error ɛ subs, the missed error ɛ miss, and the timing error ɛ time. The substitution error is the error that occurs when the transcription classifies a note as another note, including octave errors. The missing error is the error that occurs when the transcription does not classify any note when a note is actually playing. The timing error is the error that occurs when the transcription does not identify small timing issues. An example of a timing error is when one note is played multiple times in a short interval, but is classified as only being played once. 5

Hardware We used a Mac with MATLAB installed in order to train the musical template dictionary. In order to display the output data of our musical transcription system, we created a GUI on Mac OS X. The GUI is displayed in Figure 4. We performed the real-time transcription calculations on the TI TMS320C6713B Digital Signal Processor (DSP). We communicated between the Mac and the DSP using a TCP sockets networking interface. DSK Implementation The implementation of the real-time algorithm on the DSP consisted of using a 44.1 KHz sampling rate on the line input from the keyboard. The Fourier Transform computation was performed at every 4096 samples ( 0.1 seconds). The magnitude of the FT was obtained, then correlated with the template matrix W. The template matrix W was sent dynamically from the Mac after a network connection was established. The output from the correlation was returned to the Mac using the same network connection. Real-time Speed Issues The DSP code performed its calculations in real-time and sent data to the Mac periodically with little lag. The DSP code did have significantly more timing errors than the MATLAB tested code. These issues could be addressed in future work which involves performing the calculations on interleaved windows. Demo A live demo displaying our system transcribing notes from the keyboard was performed on April 26, 2011. An image of an example demo is provided in Figure 5. The notes were input into the DSP from the line in, transcribed, then displayed on the Mac on a virtual keyboard. We also allowed others to test our system by playing their own notes. In addition, when the tone of the keyboard was changed, the algorithm still performed an acceptable transcription even though the system was not trained as such (an example of which is a trumpet tone). 6

Figure 4: Mac OS X GUI (top), with notes playing (bottom) 7

Figure 5: Example Demo, Keyboard (left), Mac GUI (right), DSK is in background Results Figure 6 displays the template dictionary matrix W that was trained on the individual notes of the keyboard. The training was performed in MATLAB on a Mac. The X-axis shows the (k)th note of the keyboard (out of 61) while the Y-axis displays the w (k) template vector for the corresponding note (k). MATLAB Testing Results We performed a test of the system in MATLAB over a sample song of Mary Had a Little Lamb in C major. The results are shown in Figure 7. The X-axis shows the (j)th time window while the Y-axis represents the 61 notes. The song was sampled at 44.1 khz while the time window used was 4096 samples. The red bars in the figure represent notes that were activated at a given time frame. The sample song is provided in the given CD. MATLAB Error Rates The error rates calculated for the Mary Had a Little Lamb song are provided below. The explanations of each error rate was provided in the Testing Implementation section above. Error Fraction Percent Timing Error 2/25 0.08 Substitution Error 0/25 0.00 Missing Error 1/25 0.04 Total Error 3/25 0.12 Success Rate 22/25 0.88 8

Figure 6: Template Dictionary Matrix W Figure 7: Template Dictionary Matrix W 9

DSP Error Rates The DSP error rate was calculated for the entire chromatic scale of 61 notes. Note that timing errors were not included in this calculation, as there were significant timing errors with the DSP implementation. An improvement will be discussed in the future work section. Timeline Error Fraction Percent Substitution Error 8/61 0.13 Missing Error 8/61 0.13 Total Error 16/61 0.26 Success Rate 45/61 0.74 Date Tasks Responsibility Week 6-8 (2/14-3/6) Obtained the training data set Suyog (Keyboard note samples) Started MATLAB training code Itthi (Compute STFT and NMF on training samples) Week 9 (3/7-3/13) Finished up MATLAB training code Itthi Started implementing DSP code Suyog Week 10 (3/14-3/20) Implemented Mac code and Networking Suyog Week 11 (3/21-3-27) Finished up DSP code Itthi and Suyog Week 12 (3-28-4/3) Combined the systems and finished coding Itthi and Suyog (Combine Mac and DSP code) Week 13 (4/4-4/10) Tested on synthetic data Itthi and Suyog Week 14 (4/11-4/17) Reimplemented MATLAB code Itthi Retrained Training Notes Itthi and Suyog Finish up GUI Suyog Week 15 (4/18-4/24) Optimization of code and system Suyog Evaluation on test data Itthi Clean up and properly comment code Itthi and Suyog 10

Previous Work in 18-551 Previous projects in the course have performed limited transcription, either in the case of not using stringed instruments (such as G8-S05 ) or only detecting single-tones in a limited octave range (such as G9-S00 ). Novelty Our project performed transcription on a keyboard in real-time with notes playing throughout its full range. In addition, we utilized the DSP to perform most of our real-time calculations. In comparison, the paper referenced in [3] implemented their real-time solution in MATLAB on a 2.4 GHz PC. Discussion & Future Work Improvement of Accuracies For future work, we would like to improve our accuracies on the lower octaves by spreading out the template matrix. This can be achieved by downsampling, which spreads out the frequency content in the Fourier domain. In addition, we can improve the resolution of the template vectors w (k) after downsampling by zero-padding in time [7]. Alternative to Correlation & Polyphonic Music In addition, we can modify our algorithm by using a non-negative matrix decomposition method to determine the h k vectors [3] (rather than the correlation method currently used). This is done in [3] by using an idea similar to NMF to solve for the activations h j, given a fixed W. This can be represented by the following equation: v j W h j In [3], the Beta-Divergence Distance Metric is used as a cost function to solve for h j, which is defined as follows: d β (x y) = 1 β(β 1) (xβ + (β 1)y β βxy β 1 ) This distance metric, used in [3] for computing the activations h j, produces the following update equation: h h (W (vet )) T (W h).β 2 W T (W h).β 1 Where e is defined as a vector of ones and the powers are element-wise powers. We can use this new h j vector to potentially improve our accuracies while testing. In addition, [3] mentions that this new h j vector can work with polyphonic music as well. This could allow our implementation to work with multiple notes at the same time. 11

Improve DSP Implementation Our DSP implementation does not currently use interleaved or hamming windowed functions (as our MATLAB implementation does). We can potentially improve the accuracies of our DSK implementation by using 50% overlapping hamming windows for time window of the input signal. References [1] - Peeling, Paul H., Probabilistic Modelling and Bayesian Inference Techniques for Music Transcription, University of Cambridge, 2007. [2] - Reis, G.; Fonseca, N.; Ferndandez, F.;, Genetic Algorithm Approach to Polyphonic Music Transcription, Intelligent Signal Processing, 2007. WISP 2007. IEEE International Symposium on, vol., no., pp.1-6, 3-5 Oct. 2007 [3] - Dessein, A.; Cont, A.; Lemaitre, G;, Real-time Polyphonic Music Transcription with Non-Negative Matrix Factorization and Beta-Divergence, International Society for Music Information Retrieval Conference, 2010. [4] - Short-time Fourier Transform. Wikipedia, the Free Encyclopedia. Web. 12 Feb. 2011. http://en.wikipedia.org/wiki/short Time Fourier Transform. [5] - Lee, D.; Seung S., Learning the parts of objects by non-negative matrix factorization, Nature, 1999. [6] - Berry, M.; Browne, M.; Langville, A.; Pauca, V.; Plemmons, R.; Algorithms and Applications for Approximate Nonnegative Matrix Factorization, Elsevier Preprint, 2006. [7] - Oppenheim, A.V., Schafer, R.W., Yoder, M.T., and Padgett W.T., Discrete-Time Signal Processing, Prentice Hall, 2009. [8] - Lee, D.; Seung, S.; Algorithms for non-negative matrix factorization, Advances in Neural Information Processing Systems, April 2001. 556-562. 12