Automatic Guitar Chord Recognition

Size: px

Start display at page:

Download "Automatic Guitar Chord Recognition"

Todd Carter
5 years ago
Views:

Registration number 100018849 2015 Automatic Guitar Chord Recognition Supervised by

1 Registration number Automatic Guitar Chord Recognition Supervised by Professor Stephen Cox University of East Anglia Faculty of Science School of Computing Sciences

2 Abstract Chord recognition is a well explored area of music information retrievel, but guitar chord recognition remains a relatively unexplored area. The purpose of this project was to develop a system that could recognise chords played in real time on a guitar and use this to form the basis of a software package that could provide musicians with a teaching or transcription tool. Initial research was carried out on single note and two note chord recognition which achieved accuracies of 74% and 90% respectively. The final system for recognising chords containing an arbitrary number of notes used an implementation of Pitch Class Profiles. A comparison was made between a learnt Bayesian approach and a non-learnt Nearest Neighbour approach. For common chords, the learnt approach produced an accuracy of 99%, compared to 94% using the non-learnt approach. When extended to complex chords, the learnt approach produced an accuracy of 84% and 80% with the non-learnt approach. A real-time demonstration was then produced in MATLAB. Acknowledgements I would like to thank Stephen Cox for his support and guidance throughout this project. Thank you to mum and Hattie for proof-reading despite not understanding and a further thank you to Hattie for helping me through the tough stages of work.

3 Contents 1. Introduction Background and Motivations Aims Objectives Literature Review Motivations Background Music Theory General Music Theory Guitar Characteristics Approaches Early Work Pitch Class Profiles Template Approaches Probabilistic Approaches Technical Background Fourier Transform Spectral Leakage Frequency Resolution Fast Fourier Transform Distance Measures Euclidean Hamming Cosine Naive Bayes Multinomial Model Multivariate Bernoulli Model Nearest Neighbour Musical Terms Reg: iii

4 4. Development of the System Data Collection Segmenting Obtaining the Frequency Spectrum Peak Picking Single Note Recognition Two Note Chords Revised Peak Picker Feature Extraction Classification Pitch Class Profiles Feature Extraction Template Classification Probabilistic Classification Results Single Note Recognition Two Note Recognition Pitch Class Profiles Nearest Neighbour Naive Bayes Complex Four and Five Note Chords Chords on Acoustic Guitar Development of a Real-Time Application System Structure User Interface Discussion and Future Work Discussion Future Work Summary Reg: iv

5 References 45 A. Single Note and Two Note Recognition Confusion Matrices 47 B. Nearest Neighbour Confusion Matrix 48 C. Naive Bayes Confusion Matrix 49 D. Nearest Neighbour and Naive Bayes for Acoustic Chords 50 Reg: v

6 1. Introduction 1.1. Background and Motivations Current research into chord recognition from audio input focuses on extracting a sequence of chords from an audio file stored on a computer, often a piece of popular music. There is currently very little research available that can extract the identity of individual chords, despite the potential of this to aid musicians. Beginner guitarists often learn by simply following guitar tabulature, and are unaware of the characteristics of the chords they are playing and how they fit in with the music theory behind the instrument. It is important for a musician to understand the chord they are playing and having a tool to help aid recognition could be very beneficial. This project aims to develop a system that focusses on extracting single chords in real time from an audio signal played by a guitar. The research will focus on comparing the effectiveness of template approaches with those that are probabilistic, before applying them to a real time system Aims The aim of this project is to develop a system that can label chords played in real time on an (electric) guitar. Through the use of signal processing techniques, a system will be developed that can label chords containing any number of notes in real time. The system will be extensively tested across a wide range of chord shapes, as well as on both electric and acoustic guitars. A real time demonstration will then be developed which could serve as a tool for beginner musicians, or even be extended to a transcription tool Objectives At the completion of the project, it was hoped that the following key objectives would be met: Explore and evaluate effectiveness of various template and probabilistic approaches. Reg:

7 Develop a system capable of detecting common guitar chords using both a template and a probabilistic approach with accuracy of over 90%. Extend the system to recognise more complex four and five note chords. 2. Literature Review 2.1. Motivations Automatic guitar chord recognition can be regarded as an aspect of music information retrieval (MIR). In order to begin this project, it is important to be aware of the research which has previously been conducted in the field of MIR. Despite much research into the structure of music over the last century, significant strides in MIR have only been made recently. Because music is now commonly stored as files on computers connected by the internet, MIR has become more possible to achieve, and also very useful. One particular interesting area of music processing is chord recognition. Research into this area has focused on the labelling of continuous chords in music recordings using template and probabilistic approaches. As the core research in this project involves a comparison into the effectiveness of template and probabilistic approaches, it is important to investigate the research that has been carried out on these techniques. Therefore this literature review focuses on some template and probabilistic techniques, as well as other areas of knowledge required for the project, including background music theory Background Music Theory General Music Theory To be able to implement a guitar chord recogniser, some background knowledge of music theory is required. Firstly, a chord is formed from three or more notes, with no maximum (although this is limited by the number of strings or fingers available). The most basic form of chord is a triad which as it s name suggests is made up of three notes. The first note of this chord is called the root, and another two notes are added above it. Reg:

8 In a triad, the added notes are always the third and the fifth. For instance, in a C Major triad, the base note is C and the added notes are E and G. Chords can be classified as being Major, Minor, Augmented and Diminished (Taylor, 1989). Chords can be extended further through inversions. A chord s inversion describes the relation between its bass to the other tones in the chord. In an inverted chord, the root is not the bass. Inversions are numbered in the order their bass tones would appear in a closed root position chord. For example, in the first inversion of a C Major triad, the 3rd is an E, with the 5th and root stacked above it. In the second inversion of a C Major, the bass is now G, and is positioned at the 5th of the triad, with the root and 3rd above it. Minor triads can also be extended to be either diminished or augmented. A minor triad becomes diminished by lowering the 5th scale degree by half a step. Therefore for a Major triad to become diminished, the 3rd and 5th scale degrees are lowered by half a step. To augment a Major chord, the 5th note is raised by a semitone (Lerdahl and Jackendoff, 1985) Guitar Characteristics The guitar as an instrument has many specific characteristics in the sound it produces. The fundamental frequencies of notes have overtones which are frequency components that are higher multiples of the fundamental frequency. The method of picking the string will either exaggerate or diminish the strength of these overtones, which helps to produce the guitars unique timbre. The guitar I am using for this project is an electric guitar, therefore the sound is detected through magnetic coils inside pickups. As a string is strummed, it vibrates and disturbs the magnetic field around it, which is detected by the pick-ups. The length of the string determines the pitch of the note being played. This explains why the further down the fretboard you play, the higher the pitch of the note. This sound is then further amplified through an amplifier. Some recording may also be made on an hollow body acoustic guitar. The tone of an acoustic guitar is heavily reliant on its shape and material, and the vibration of the strings is amplified by the wooden body. (Alexander, 2014). Reg:

9 2.3. Approaches Early Work Traditionally, the task of chord recognition was treated as a polyphonic transcription to identify individual notes. This approach was fundamentally flawed as it suffered from recognition errors caused by noise, and overlapping harmonics in the spectrum of the input signal. The first alternative approach was proposed by Leman, who developed the Simple Auditory Model (SAM). This was the first approach that used an intensity map of the twelve semitone pitch classes, calculated from a spectrum, and therefore achieved more robustness than the note names used previously (Chafe, 1986) Pitch Class Profiles Pitch Class Profiles (PCPs), proposed by Fujishima (1999), used SAM as a framework to produce the first robust method of chord detection. PCPs, also referred to as chroma vectors, are twelve dimensional vectors that represent the intensities of the twelve members on the chromatic scale without regard to their octave (Fujishima, 1999). Fujishima s algorithm works by first taking the Discrete Fourier Transform (DFT) on a fragment of the input signal, and mapping these values on the spectral bins that further map the twelve semitones on the chromatic scale. A vector of twelve intensities are produced for each frame, which are then summed to produce the PCP of the whole note. Chords are then classified using either the Nearest Neighbour method or the weighted sum method. Fujishima also applied several heuristics to the PCP vectors which included smoothing over past PCPs using an averaging operation to reduce noise. He found that smoothing did reduce noise, however it created over-smoothness, blurring the chord change points. He also applied chord change sensing by monitoring the change in direction of the vector and this helped to preserve the chord change points. When using synthesised sounds, Fujishima s algorithm produced very high accuracy using both Nearest Neighbour and weighted sum pattern matching. However when used with real musical recordings, the results produced were not meaningful as the accuracy percentage was not high enough (Fujishima, 1999). Due to these shortcomings, extensions to PCPs have been proposed. Reg:

10 A Chromagram is a collection of PCPs, sometimes referred to as chroma vectors, Despite the popularity of chromagrams, they are not without their flaws. Chroma extraction algorithms often represent the chroma vector using binary values at each bin. This approach does not work for real world recordings, as acoustic instruments produce overtones as well as fundamental notes. This means that regardless of whether the fundamental notes are extracted correctly, there will be non-zero intensities at all twelve points on the chroma scale. These noisy chroma vectors could cause problems later on in the recognition process (Lee, 2006). Lee (2006) proposed an extension to PCPs called the Enhanced Pitch Class Profile (EPCP). EPCPs aim to enhance traditional chroma vectors by making them more similar to their binary type, like the templates used in pattern matching. EPCPs are calculated from the Harmonic Product Spectrum (HPS), which is an extension of the DFT of an input signal. The guitar produces a sound that has harmonics at the integer multiples of its fundamental frequency, so decimating the original magnitude spectrum by an integer number will also yield a peak at its fundamental frequency. The HPS is calculated by multiplying the magnitude spectrum, and the peak in the HPS is the fundamental frequency. The chroma vectors are then calculated from the HPS instead of the DFT (Lee, 2006). Cremer (2004) proposed an alternative which is to derive the chromas from a frequency warped Fast Fourier Transform (FFT) followed by the erasing of overtones and the separation of tonal components from transients. Gómez (2006) took a different approach and instead found the peaks on the spectrum by using the local maximum, and estimated the peak magnitudes by quadratic interpolations. The chroma vector can be calculated by weighting each peak by its contribution to each chroma vector bin (Stark and Plumbley, 2009). The use of an EPCP was found to outperform the conventional PCP vector both at the original frame-rate as well as when the signal was smoothed. The difference in performance becomes more apparent when there is a greater degree in confusion between harmonically related chords, as EPCPs are much less sensitive to this confusion. Other extensions to traditional PCPs have been proposed including one by Oudre et al. (2009), which states that the twelve dimensional chroma vector should be made up of the amplitudes present in the chord that are larger than those of the non-played chromas. By Reg:

11 introducing the chord templates for different chord types and roots, the chord present should be the one that is closest to the chroma vector according to a specific measure of fit Template Approaches Template-based chord recognition methods are built on the methodology that only the chord template is needed for recognition (Oudre et al., 2011a). A chord template is a twelve dimensional vector representing the twelve semitones on the chromatic scale (Fujishima, 1999). The simplest chord templates have a binary structure, with values of 1 at chromas present in a chord definition, and 0 for other chromas. In common approaches, each chord is modelled by a binary Chord Type Template (CTT). The detection is then performed by first calculating the scores for every root and chord type. These scores are computed from both chroma vectors and hand-tuned variations of the original CTT (Oudre et al., 2011a), and the best score is then selected. Harte and Sandler (2005) further improved this method by applying a frequency tuning algorithm. They first define CTTs for only four chord types and then calculate the dot product between the chroma vectors and chord templates. The recognition is then conducted by applying low-pass filtering on the chromogram and median filtering to the detected chord sequence (Harte and Sandler, 2005). Lee (2006) also used a binary chord template, however carried out recognition on the EPCP by maximising the correlation between the chroma vectors and chord templates (Lee, 2006). These approaches described above, all use derivatives of a Nearest Neighbour method or a weighted sum method. The Nearest Neighbour method involves finding the Euclidean distance between chord templates and chroma vectors (Stark and Plumbley, 2009). The weighted sum method works by manually tuning each chord template to give them different weights, so as to reflect the number of notes in the chord type, and the probability of the emergence of the chord type. Negative weights can also be applied for better separation among similar chord types (Fujishima, 1999). Reg:

12 Probabilistic Approaches The most common probabilistic approach is the use of Hidden Markov Models (HMM), which are most helpful for recognising sequences of chords. A HMM constitutes a number of states, with an initial state, a state transition probability matrix which gives the probability of moving from one state to another, and an observation probability which gives the likelihood of a particular state being selected (Rabiner, 1989). In typical chord recognition systems every chord is represented by a hidden state and the chromagram frames are the observations. The chord recognition involves finding the most likely sequence of hidden states that could have generated the output sequence. The HMM parameters are either based on music theory, learned on real data, or a combination of the two (Oudre et al., 2011b). The first HMM used in chord recognition was proposed by Sheh and Ellis (2003). Their system is comprised of 147 hidden states each representing a chord and corresponding to seven types of chords - Major, Minor, Dominant, Seventh, Augmented and Diminished. There are also twenty one root notes represented as states. The HMM parameters are then trained with an EM algorithm (Sheh and Ellis, 2003). This model was improved by Bello and Pickens (2005) who proposed a compete rebuilding of the HMM, by reducing the hidden states to twenty four, by consideration of the Major and Minor chords only. The HMM initialisation is then inspired my music theory which naturally introduces musical knowledge into the model. The state transition and initial state probabilities remain the same, however the observation probabilities are fixed, giving each chord a clear predetermined structure. (Bello and Pickens, 2005). Others have experimented with twenty four states but with different sets of input features. (Ryynänen and Klapuri, 2008). 3. Technical Background 3.1. Fourier Transform The Fourier Transform is a method for extracting the magnitude and phase spectrum from a sound signal. For the purposes of note recognition, only the magnitude spectrum Reg:

13 is considered, as the human ear is relatively insensitive to phase. The magnitude spectrum can be thought of as a decomposition of a signal into its frequency components. In application to sound signals it is known as the Discrete Fourier Transform (DFT), which is equivalent to the continuous Fourier Transform for sampled signals. In other words, it acts on sequential discrete sections of the signal as opposed to treating it as continuous (Harris, 1978). Equation 1 is the function for computing the DFT of a signal where X(n) is the result, x(n) is the input signal, N is the length of the signal, n is the nth time-domain sample and m is the mth frequency bin. X(n) = n=0 x(n)e j2πnm/n wherem = 0,1,...,N 1 (1) N 1 The discrete nature of the DFT means that the result is only ever an approximation of the signal s frequency spectrum. The resulting values are complex numbers that contain real and imaginary parts. From these complex numbers the phase angle, X θ, and magnitude, X can be retrieved. The magnitude spectrum of the result is equivalent to the frequency spectrum of the signal, and it the only part of the DFT result that will be used in this project Spectral Leakage The DFT computation assumes that a signal is periodic in N, which is the length of the signal being analysed. When the DFT of a non-periodic signal is computed, the resulting frequency spectrum suffers from leakage. This leakage results in the signal energy being smeared out over a wide frequency range. The dispersed nature of the DFT produced makes it harder to determine the frequency content of the signal. The most effective way of tackling spectral leakage is to apply a windowing function to the signal. By default, all discrete signals have a rectangular window applied to them that multiplies every point by 1. Windowing functions, such as a Hamming window, taper the amplitudes of the signal at the start and end of the signal, to a smaller amplitude than the peak value in the centre. The DFT of a window has a peak at the applied frequency, and smaller side lobe peaks on either side. The height of these side lobes indicate the effect that the windowing Reg:

14 function will have on frequencies around the applied frequency. Generally, the lower the side lobe, the more the window will reduce leakage in the DFT Frequency Resolution The DFT returns a discrete spectrum, as opposed to a continuous one, therefore the frequency component of the signal is resolved to a finite number of bins. The resolution of an N-point DFT is the frequency spacing between each of these bins. The frequency resolution is calculated using equation 2 as shown below. f resolution = f s (2) N Where f s is the sample rate and N is the number of bins. For example, a point DFT of a signal sampled at 16 khz will have a frequency resolution of 0.5 Hz. Therefore the frequency spacing between each DFT bin is 0.5 Hz. The value of N varies depending on the time of the signal the DFT is taken on, and the sampling rate Fast Fourier Transform Although the DFT has a simple implementation and produces correct results, it is an inefficient algorithm and therefore not applicable to the real-time needs of this project. The first efficient implementation of the DFT, called the Fast Fourier Transform (FFT) was introduced by Cooley and Tukey (1965) and produces the same output as the DFT, but with many of the redundant calculations removed. The traditional DFT often performs the same calculations several times, which is avoided when using the FFT. In MATLAB, the fft function is based on the Cooley-Tukey algorithm. The execution time for the FFT depends on the length of the transform and is fastest for transforms whose lengths are powers of 2. The FFT is actually slower than the DFT for lengths that are prime or have large prime factors, although it is still faster when lengths only have small prime factors Distance Measures Distance measures, sometimes refereed to as metrics, are functions that define the distance between two vectors. The distance metrics used throughout this project are defined Reg:

15 below Euclidean The Euclidean distance between two vectors is the length of the line segment connecting them in N-dimensional space. The distance between vectors p and q is given by equation 3: d(p,q) = n i=1 where p i and q i are the values of p and q in the ith dimension. (p i q i ) 2 (3) Hamming The Hamming distance between two vectors of equal length is the number of positions at which the corresponding symbols are different. It measures the minimum number of errors that could have transformed one vector into the other, as shown in equation 4. D H = m x i y i k=1 x i = y i D i = 0 x i y i D i = 1 (4) Cosine The Cosine distance is a measure of the angle between two vectors. If both vectors are equal then the angle between them is 0, therefore the cosine distance between them is 1, and as cos(x), where x = 0, is 1, the -1 in equation 5 is needed. d(p,q) = 1 n i=1 p i q i n i=1 (p2 i ) n i=1 (q2 i ) (5) 3.3. Naive Bayes Bayesian methods of classification are based on probability theory and play a critical role in probabilistic learning and classification. Bayesian models build a generative Reg:

16 model that approximates how the data is produced. This is done using the prior probability of each class given no information about an item. The categorization produces a posterior probability distribution over the possible classes given a description of an item (Manning et al., 2008). The probability of chord d being in class c is computed using equation 6: P(c d) P(c) P( f k c) (6) 1 k n d where P( f k c) is the conditional probability of note f k occurring in a chord of class c. We interpret P( f k c) as a measure of how much evidence there is that c is the correct class. P(c) is the prior probability of a chord occurring in class c Multinomial Model Multinomial Naive Bayes only considers notes that are in the query chord therefore only the presence of a note is considered. The goal of this is to find the best class for the chord. The best class in Multinomial classification is the most likely given by the equation: C map = arg maxp(c d) = arg maxp(c) P( f k c) (7) 1 k n d In equation 7, the conditional probabilities are multiplied together. As the values for the probabilities are often very small, many multiplications can result in a floating point underflow. It is therefore more common to perform the calculation by adding the logarithms of probabilities as opposed to multiplying them. The class with the highest log probability score is still the most probable. Hence, equation 7 can be rewritten as: [ ] C map = arg max logp(c) + logp( f k c) 1 k n d In equation 8, log p( f k c) is a weight that indicates how good an indicator f k is for class c. The sum of these together with the logarithmic prior (P(c)), is then a measure of how much evidence there is for the chord being in the class, and equation 8 selects the class for which there is the most evidence. The parameter P(c) is found using the maximum likelihood estimate which for prior (8) Reg:

17 probabilities is found using equation 9: P(c) = N c N where N c is the number of chords in class c and N is the total number of chords. The maximum likelihood estimate for each note, P( f c), is estimated as the relative frequency of note f in chords belonging to class c, as shown by equation 10: (9) P( f c) = F c f f V F c f (10) where F c f is the number of occurrences of f in the training documents from class c, including multiple occurrences Multivariate Bernoulli Model An alternative to the Multinomial Model is the Multivariate Bernoulli Model. The Bernoulli Model estimates p( f k c) as the fraction of documents of class c that contain note f k. Bernoulli uses all the notes in the vocabulary and so takes into account the absence of a note in the query as well as its presence. The parameter P( f c) is calculated using equation 11: p( f c) = D k=1 δ( f,c) + 1 D + 2 (11) where D is the number of chords in class c and: δ( f,c) = { 1 if note f occurs in chordd 0 otherwise (12) 3.4. Nearest Neighbour Nearest Neighbour is a very simple but powerful classification technique that is based on the premise that vectors which are close to each other in a vector-space belong to the same class. The simplest form of Nearest Neighbour is 1NN classification, where each chord is assigned the class of its Nearest Neighbour. The more common form of Nearest Neighbour is knn classification, which is much more powerful technique than 1NN. For knn, we assign each chord to the majority Reg:

18 class of its k closest neighbours where k is a parameter. knn is more robust than 1NN as it does not rely on single examples in the training data. Nearest Neighbour has some advantages over other classification techniques. Firstly, it does not require any feature selection, which is often necessary for Naive Bayes classification. Secondly, it scales well for large numbers of classes, as there is no need to train n classifiers for n classes. It is also possible run an 1NN classifier without any training data, although some sort of library is still needed Musical Terms Throughout this project, terms relating to the music theory of western music will be used. In order for them to be fully understood and referred to, they are defined below as: Chromatic Scale: A musical scale with twelve pitches, each a semitone above or below each other. Chord: A chord is defined as any harmonic set of three or more notes that is heard as if sounding simultaneously. The most common types of chords are called triads, and contain three distinct notes. Further notes may be added to triads to produce seventh or ninth chords. Power Chords: Power chords are a type of chord specific to electric guitar music and are diads that contain a root note and the fifth. Note: A note is a single tone from the chromatic scale without any regard to the octave. For example C or C. Pitch: Pitch is when a note is written with its harmonic information included, for example A4 is an A note at the fourth harmonic. Reg:

19 4. Development of the System 4.1. Data Collection In order to properly evaluate the effectiveness of the methods investigated, a large library of notes and chords had to be recorded. For chords played on the electric guitar, chords were recorded at Hz through a USB interface using the software package Logic Pro X. A sample rate of Hz was chosen as it is good practice to keep your library at a high sample rate as recordings can be resampled to lower rates when required, but can not be upsampled effectively from a lower sample rate. Chords recorded on an acoustic guitar were recorded through a microphone at Hz through the software package Audacity Hz was chosen as this was the highest sample rate the microphone used could record at. Table 1 shows the chords recorded on the electric guitar and Table 2 shows the chords recorded on the acoustic. Table 1: Chords recorded on the electric guitar Range Sets Single Notes E2 - E5 5 Fifths E5(2) - E5(3) 25 Major C - B 25 Minor C - B 25 Major 7th C - B 25 Major 9th C - B 25 Table 2: Chords recorded on the acoustic guitar Range Sets Major C - B 15 Minor C - B 15 Reg:

20 Before classification, each of the recordings was down-sampled to Hz. This figure was chosen in order to remove as much of the unnecessary high frequency information as possible. As the highest frequency possible when sampling at Hz is 5000 Hz, the chosen sample rate sufficiently covered the frequency range produced by a guitar Segmenting In order to speed up the process of separating the recordings into their individual chords, an automatic segmenter function was written, which was designed to take in a recording, and produce an array containing the start and end points of each chord in the recording. It works by splitting up the recording into a set of small frames and then taking the energy of each of these frames. The frames with a high energy content are chords and frames with low energy are gaps between chords. Although not formally tested, the segmentation process appeared to work perfectly, albeit after occasional changes to the frame size, as all sets of segments produced were of the correct size. It would have been obvious if the segments were faulty as the frequency spectrum produced would have been unusual Obtaining the Frequency Spectrum The method used for obtaining the frequency spectrum was identical throughout this project and therefore can be explained at this point. The method was applied to a segment of a recording, sampled at Hz. Whilst still in the time domain, a Hamming window was applied to the signal in order to reduce spectral leakage. The resulting signal was then passed into the Matlab fft function. The output from this produces an array of complex numbers so the magnitude of each value was taken. The frequency resolution was calculated as the sample rate, divided by the length of the input signal. This ensured that the frequency resolution was deep enough in order to avoid the frequencies of notes overlapping. The smallest gap between notes of interest was 7 Hz, which is the gap between an E2 and F2. As the smallest input length was likely to be around two seconds, the frequency resolution was 0.5 Hz, which was sufficient. The frequency spectrum produced could then be analysed by the peak picker. Reg:

21 Peak Picking The performance of the peak picking algorithm is vital to this project as it underpins the methods of recognition used throughout this project. The purpose of the peak picker is to extract the frequencies that are dominant in the frequency spectrum. High energy frequencies present in a frequency spectrum are identifiable by the fact they are very high peaks in an otherwise flat graph. A peak picker needs to be able to correctly identify these peaks and their exact locations in the spectrum. After initial research into peak picking it was decided that it would be acceptable to use the built in MATLAB function findpeaks. The findpeaks function provides a comprehensive peak finding algorithm that contains several configuration parameters. The parameters relevant to this project are: MinPeakHeight - The minimum height of a peak. Threshold - The minimum height difference between a peak and its neighbours. MinPeakDistance - The minimum peak separation. npeaks - The maximum number of peaks to return. In order to properly test the effectiveness of findpeaks under different configurations, two sets of single notes had their peaks manually labelled. For each note, the fft was taken and the number of peaks, as well as their locations was recorded. For each configuration, the precision, recall and F-measure was taken for a range of values. The precision was calculated as the proportion of correctly labelled peaks against the total number of peeks detected. Recall is the proportion of peaks detected against the total number of expected peaks. The F-measure is the weighted harmonic mean of the precision and recall calculated using equation 13. F = 2 precision recall (precision + recall) (13) The first configuration tested was the threshold. The input value for threshold was tested from 1 to 50. Figure 1 shows the change in precision, recall and F-measure across this range. In terms of precision, the score starts low, however increases very quickly as the Reg:

22 threshold is increased. The score continues to rise until threshold reaches 22, where the precision starts to fall again. However, although the also recall rises very quickly, it s score begins to decrease after threshold reaches 8, and this is the same for the F-measure. As a result, 8 was chosen to be the optimum value for threshold. Figure 1: Effect of changing threshold The second configuration tested was the minimum distance. Again, this was carried out on an input value range of 1 to 50. However, the threshold was set to the optimum from the previous experiment of value 8. Figure 2 shows the change in precision, recall and F-measure. The scores for all three measures starts very low, rises quickly for low input values, and levels off after 14. Although a small increase was seen in precision and F-measure after 40, this was not chosen as the optimum as the increase was too small to be useful. Therefore the optimum was chosen to be 14. Figure 2: Effect of changing minimum distance The final configuration tested was the minimum height. This test was carried out again with threshold at 8, and now minimum distance at 14 across the range of 1 to Reg:

23 50. Figure 3 shows the change in precision, recall and F-measure. With the threshold and minimum distance parameters at their optimum, changing the minimum height had little effect on accuracy scores, however when the input value was 32, there was a small increase in precision and F-measure, whilst the recall stayed the same. Therefore the optimum was chosen to be 32. Figure 3: Effect of changing minimum height With the optimum configurations now set for findpeaks, the peaks could now be successfully extracted from the frequency spectrum. Figure 4 shown the frequency spectrum of a C Major chord with the peaks clearly marked. The findpeaks function returns two arrays, containing the locations of the peaks, and their amplitudes. The peak locations were scaled to the frequency resolution of the frequency spectrum before being used for classification. Figure 4: Frequency Spectrum of C Major with Peaks Reg:

24 4.2. Single Note Recognition The first classification technique approached was on simple single note recognition. This was chosen in order to test the effectiveness of the peak picking algorithm and the application of the fft as these techniques were the basis for the remainder of the project. A very simple approach would be to take the fft and label the note based on the value of the strongest peak, assuming this to be the fundamental harmonic. However, this is prone to error as the peak with the highest amplitude is not guaranteed to be the fundamental and therefore the note can be recognised incorrectly. In order to classify notes correctly, a method was developed that took into account all the harmonics present in the frequency domain, called the mean distance method. The mean distance method works by classifying notes using similarity of the mean distance between the detected harmonics and the fundamental harmonics of the possible notes. As every harmonic after the fundamental is a multiple of the fundamental frequency, and the distance between each harmonic in the frequency spectrum is equal to that of the fundamental. In order to classify the notes, a note library containing the names of notes ranging from E2 - E5 as well as their fundamental frequencies was used. The first step is to obtain the frequency spectrum from the recorded note and obtain the detected harmonics using findpeaks. The peaks returned are inserted into a npeaks 1 array in the order they were detected, and the absolute distance between each peak is then taken. The mean of these distances is subsequently compared to the fundamentals of the notes in the note library and the note the recording is classified as, is the closest note. Table 3 shows an E2 note with all harmonics detected correctly. Despite some slight variation in the distances between peaks, the mean distance is still clearly an E2 (82 Hz). A requirement for the mean distance method is that all the harmonics have to be identified correctly. However, in practice, it is common for the peak picker to either wrongly detect two peaks that are very close together, or miss an harmonic entirely. If these erroneous peaks were included in the calculation for the mean distance, then they would alter an otherwise well defined note. To remove the erroneous peaks, the list of detected peak differences was sorted in Reg:

25 Table 3: Mean distances of an E2 note Frequencies Distances Mean Distance 82.6 ascending order. If a difference was below 77Hz then the difference was removed, as this is below the acceptable threshold for the distances between the harmonics of an E2, which is the lowest note on a guitar. The differences between the differences was then taken. If this value was greater than 7Hz then the second difference was discounted, and if it was less than 7Hz, then the first difference was removed. 7Hz was chosen as this is the smallest difference between semitones at the lowest frequency of interest and a difference greater than this implies no relation between harmonics. Table 4 shows an E2 note with a missing peak, the harmonic at 245 Hz. As a result of this, the mean distance is , therefore the note will be incorrectly labelled. Taking the distance between the distances shows that distance 2, is too different from the other differences so can be discounted. Therefore the new mean distance is calculated as being 82.6, which is correct. Reg:

26 Table 4: Mean distances of an E2 note Peak Frequencies(Hz) Distances a 2nd Distances b Final Distances Mean Distance a Distances between the frequencies b Distances between the distances 4.3. Two Note Chords Classification by the mean distance method works well for single notes. However, when applied to two note chords, it fails because the frequency spectrum now contains two sets of harmonics. Therefore the distances between peaks are no longer of any use, so a new method of classification had to be developed Revised Peak Picker After initial research into the characteristics of the frequency spectrum produced by two note chords, it became clear that the findpeaks function was no longer proving to be suitable. Despite its large number of configuration parameters, the algorithm itself was simply not designed for the type of peaks produced by the fft. The gradient of these peaks increases so quickly, findpeaks is unable to recognise the fact that there is an up-slope. Therefore a new algorithm was developed. The revised algorithm worked by first zeroing all points in the spectrum that were below a certain threshold. This removed all the redundant information at very low amplitudes and also created well defined segments, which held the harmonic peaks. Within Reg:

27 each of these segments, the maximum points were found using the second differential, and the highest maximum point was taken to be the peak. Similar tests to the ones performed on findpeaks were carried out on this new algorithm. However, the only threshold that could be changed was the height of values in the frequency spectrum to be zeroed. Figure 5 shows the effect on the precision, recall and F-measure. The experiment showed the optimum input value to be 25, as although the precision reaches higher values, the recall and F-measure both start to fall after 25. The optimum configuration was used for the remainder of the project. Figure 5: Effect of changing threshold Feature Extraction For feature extraction, a new algorithm was developed in order to extract the notes present from the frequency spectrum. The following algorithm used a library of 48, 10-dimensional vectors that represented the frequencies of each harmonic in the note. The first step of the algorithm involved collecting all the notes present in the frequency spectrum. Every peak was compared with the fundamental frequencies from the note library and assigned a number between 1 and 12 (1 being C and 12 being B). The reason each note was assigned a number was simply to simplify any comparisons that needed to be made later on. The number of the harmonic was also recorded. These values were put into a npeaks 2 array in the order they appeared in the frequency spectrum. The second step was to normalise the harmonic number assigned to each note. The assumption being made here was that the first few notes found were the fundamentals Reg:

28 of the remaining notes. Therefore in order to retain the information regarding a note s harmonic, the remaining note s harmonic number was adjusted to match the harmonic of the first note found of its type. The final step was to count the occurrences of each note. The feature vector was passed over and any note found was put into a new feature vector along with a count of its occurrence. If a note was already present in the new feature vector then the count value was simply incremented. The result of this stage was an nnotes 3 feature vector. As an example, Table 5 shows the process of extracting an E5 chord. An E5 contains an E2 and B2, which both appear in the chord, along with an erroneous F3. However this erroneous peak does not affect the labelling of the chord as there is only 1 occurrence of it, compared to 4 and 3 occurrences of E2 and B2 respectively. Table 5: Feature extraction process of an E5 Notes Frequencies Notes and Harmonics a After Normalising a E B E F E B E B Note Count b Final Chord a a (Note Harmonic) (Note Harmonic Occurrence) Reg:

29 Classification From the resulting feature vector, only the two most common notes were put forward for classification. This was based on the assumption that any erroneous note detected would not appear more often than any correct note. The now 2 2 feature vector was compared to its correct label. When the classification was carried out, a matrix containing the correct pairs of notes in the order they appeared in the recordings was used to evaluate the accuracy of the method Pitch Class Profiles The method described for two note recognition is perfectly adequate for two note chords, but cannot be realistically applied to larger chords. This is because it requires prior knowledge of the size of chord being recognised, so therefore cannot be applied to a real time application where this knowledge is not available. There is also the problem with simply selecting the most common occurring chords, because there is a greater possibility of interference from erroneous notes when chords get larger. As a result, a more versatile approach had to be taken. Pitch class profiles were chosen for three note recognition and beyond as they require no knowledge of the chord size before recognition, so are more suitable for a real time application. Pitch class profiles are 12 dimensional binary vectors, where each position represents one semitone on the chromatic scale. In the vector, a 1 represents the presence of a note, and 0 the absence of a note. For example, Table 6 contains the pitch class profiles for the Major chords Feature Extraction The purpose of feature extraction was to translate the detected peaks into the chord s pitch class profile. The process was very similar to the two note feature extraction process. For each peak detected, its frequency was compared with the fundamentals in the note library and the closest match was taken to be the note of the peak. However, for pitch class profiles, when a note was detected the corresponding position in the chord s pitch class profile was changed to 1. If the position was already set to 1 then nothing Reg:

30 Table 6: Pitch class profiles of the Major chords C C D D E F F G G A A B C Major C Major D Major D Major E Major F Major F Major G Major G Major A Major A Major B Major was changed, as pitch class profiles do not contain information about the occurrence of a note. To evaluate the performance of pitch class profiles, features were extracted from the entire chord library, as shown in tables 1 and 2, except for the single notes. For classification, a template approach using Nearest Neighbour, as well as a probabilistic approach, using Naive Bayes, were investigated Template Classification The first pitch class profile classification technique investigated was a template based approach, using Nearest Neighbour. The rational behind using Nearest Neighbour was to use an non-learnt method of classification as any chord played can be recognised without the need for training data. In order to apply Nearest Neighbour, a set of training examples is needed. However, for pitch class profiles all that is needed for recognition is a library containing the correct representation for each chord. Usually, 1NN is not as effective as knn as 1NN does not Reg:

31 protect from error in the training data. However, as the chord library contains perfect information about each chord, 1NN will still be successful. The chord library used for testing contained the pitch class profiles for 23 different chord types, for all 12 notes on the chromatic scale. This was represented as a dimensional matrix. One of the key components in Nearest Neighbour classification is the similarity measure used and three different similarity measures were used in this project. These were cosine distance, Hamming distance and euclidean distance. Cosine distance was used as it is generally considered to be the most effective distance measure. Hamming distance was chosen as the pitch class profiles are binary vectors, therefore hamming distance should produce very similar results as the cosine distance. Finally, Euclidean distance was used as it is the similarity metric used by Oudre et al. (2011b), who also used pitch class profiles for classification. To test the effectiveness of Nearest Neighbour, the technique was tested on every chord example in the chord library, excluding single notes. The classification was based on the closest vector in the training set to the test vector being the correct chord. Therefore, the distance between each test pitch class profile, and every training vector was taken. Tables 7, 8 and 9 show the top 5 cosine, Hamming and euclidean distances between a C Major chord, and the chords in the chord library. Table 7: Cosine distances between a C Major, and chords in the chord library. CMa jor CMa jor7 C5 CMinor EMinor CMa jor Table 8: Hamming distances between a C Major, and chords in the chord library. CMa jor C5 CMa jor7 CMinor EMinor C Major Reg:

32 Table 9: Euclidean distances between a C Major, and chords in the chord library. CMa jor C5 CMa jor7 CMinor EMinor C Major The final accuracy of Nearest Neighbour was calculated as the sum of correctly identified chords, divided by the total number of chords classified. The classification function produced an nchords nexamples matrix where nchords was the number of different chord types tested, and nexamples was the number of examples for each. Every position in the matrix was a number representing the name of the chord recognised. This matrix was compared to a matrix of the same size containing the correct results. A comparison between these matrices produced a confusion matrix Probabilistic Classification The second pitch class profile classification technique investigated was a probabilistic approach using Naive Bayes. Where Nearest Neighbour is an non-learnt classification technique, Naive Bayes is a learnt approach, as training data is required for classification. The two types of Naive Bayes models used for classification were a Multinomial and a Multivariate Bernoulli model. In order to train the models, each set of chords was split into 15 training examples and 10 testing examples. This was carried out because it is very easy to obtain erroneous high accuracy on a model that has been trained and tested on the same data. In order to achieve a better idea of real world performance, the data had to be split. For the Multinomial model, the conditional probabilities were calculated using the equation 10. For example, table 10 shows the calculated conditional probabilities for a C Major. A C Major contains the notes C, E and G, therefore their corresponding conditional probabilities are much larger than those of the notes that do not appear. The conditional probability of D is also slightly higher weighted than the others, which could be due to one of the chords in the training set containing an erroneous occurrence of D. Reg:

33 Table 10: Conditional Probabilities of a C Major using a Multinomial Model C C D D E F F G G A A B CMa j For the Bernoulli model, the conditional probabilities were calculated using equation 11. For example, table 11 shows the calculated conditional probabilities for a C Major. The spread of weights is very similar to the Multinomial model, with occurring notes being weighted much high than non-recurring ones. As the PCP is a binary vector, the value for occurrences of a note and chords containing a note will be identical, so the only difference in performance between Multinomial and Bernoulli will be observed at test time. Table 11: Conditional Probabilities of a C Major using a Bernoulli Model C C D D E F F G G A A B CMa j In order to classify a chord using Naive Bayes, the class with the most evidence produced for being the chord, has to be calculated. For the Multinomial model, equation 8 is used. The logarithmic prior probabilities of all the classes are set equal, using equation 9. Then for each occurring note in the chord, the logarithmic conditional probability is added to the prior. For example, Table 13 shows the 5 classes with the highest evidence for a C Major chord. For the Bernoulli model, the prior probabilities of all the classes are set equal using equation 9. Following on from this, for each occurring note in the chord, the probability is added to the prior. However, unlike when using Multinomial, for every absent note, 1 P( f c) is added to the prior. For example, Table 12 shows the 5 classes with the highest evidence for a C Major chord. Reg:

34 Table 12: Classes with most evidence for a C Major using Multivariate CMa jor C5 CMa jor7 EMinor AMinor CMa jor Table 13: Classes with most evidence for a C Major using Multinomial CMa jor CMa jor7 C5 A5 CMinor CMa jor Results 5.1. Single Note Recognition Classification using the mean distance method for single notes produced an accuracy of 77%. This was a disappointing result, because for a relatively simple task of picking one frequency value from the spectrum, a higher accuracy would be expected. However, on inspection of the confusion matrix, as shown in appendix A, many of the notes were recognised correctly 4 out of 5 times, which suggests that one poor set of recordings has skewed the final accuracy. There were also many instances of notes being incorrectly recognised as E2s which suggests possible interference from the E string. This interference could have originated from the possibility that many of the note s were on the A string, whilst the E string was being muted. Therefore if the E was not being muted properly, then the E2 would appear in the frequency spectrum. The mean distance method could have been improved by refining the method for discounting erroneous peaks. In it s current form, a few bad peaks at the start of the note can throw the entire classification, as the distances used for comparison would be wrong. One possible solution could be to use the lowest common multiple of the peaks when discounting erroneous ones. An entirely different, and more sophisticated approach, could be to use K-means clustering for recognition, where each peak would be matched to a cluster, and the most occurring cluster would be the note. Reg:

35 5.2. Two Note Recognition The tests for classifying two note chords produced an accuracy of 90%. This was a reasonable result, however there is still room for improvement. The improvement in performance over single note recognition could be due to the superior method of removing erroneous notes, as the position of the error did not effect the final classification. The improved peak picker also helped to boost performance as fewer erroneous peaks were detected. One observation from the confusion matrix, as shown in appendix A, was that no chords were classified incorrectly. All chords were either recognised or not. This suggests that any failed chords either contained only one detected note, or one of the notes was a semitone out, therefore no chord existed for it to be classified as such. There were three chords that consistently failed to be recognised, A3 E4, A 3 D 3 and G 2 D3. This could simply be related to poor playing during the recordings. This method could be improved with a more sophisticated approach to picking the two notes in the chord. Currently, the two most occurring notes are chosen, but this is susceptible to interference from erroneous notes, especially when few notes are recognised. In addition, in the situation where three or more notes all occur equally many times, the decision as to which two notes are selected is left to the MATLAB sort function to decide. One solution would be some form of pattern matching algorithm, which has prior knowledge of the relationship between the notes in the chords. The note selection would subsequently be made based on the relationship between the notes detected Pitch Class Profiles Nearest Neighbour Using Nearest Neighbour, an accuracy of 93.3% with cosine similarity, and 94% using both Hamming and euclidean distance was achieved, as shown in Table 14. This was a good result as it exceeded the 90% target. The confusion matrix produced is in appendix B. It is interesting to note that generally, chords that were incorrectly recognised more commonly all contained an E. This may be due to the low E string producing lower Reg:

36 energy than other strings, especially higher up the fretboard and therefore the frequency data for an E may have been consistently missed. This may help to explain why chords that failed consistently during two note recognition, also included the low E string. There is also a small group of 7 chords, ranging from D Major to A Major that were all incorrectly recognised on several occasions. These chords are all at the extreme ends of the fretboard, with D Major being played at the 10th fret and A Major on the 5th. At the higher end of the fretboard, chords were incorrectly recognised as fifths, which suggests that in some of the recordings the 3rd string was not being fretted properly, due to the small fret size. At the lower end of the fretboard, the chords were labelled as chords outside the scope of the confusion matrix. This may be because of interference from the open strings that caused greater interaction between peaks. This interaction would have distorted the frequency spectrum therefore creating errors during peak picking. When the tests were first run on a PCP library that contained only the chords being tested, the accuracy was recorded at 98%. Subsequently, when the PCP library was extended to contain over 300 chords, the accuracy dropped to 94%. This demonstrates the disadvantages of an non-learnt approach, as chords with extra or missing notes are recognised incorrectly. As the chord combinations are so large, it is likely that a slightly incorrect chord can appear to be a completely different chord Naive Bayes Classification using Naive Bayes achieved an accuracy of 99% with a Multinomial model, and 98% with a Multivariate model respectively, as shown in Table 14. This was an excellent result as the accuracies were close to perfect. It can be observed from the confusion matrix, in appendix C, that both models failed on the same chords. Both incorrectly recognised a C as a C Major7 on two occasions, as well as a G Minor as a G 5. The performance of Naive Bayes could have been improved by applying some feature selection to the conditional probabilities. This would help refine the training data as erroneous notes in the training data will be ignored. The feature selection could come in the form of mutual information or the Chi square statistic. Another method of improving performance would be to extend the training set. Currently, the training is performed Reg:

37 on 15 examples of each chord. Increasing this value may well increase performance, however as the accuracy is already at 99% it may make very little difference. Table 14: Classification Accuracy for Common Chords Method Accuracy Nearest Neighbour - Cosine 93.3% Nearest Neighbour - Hamming 94% Nearest Neighbour - Euclidean 94% Multinomial 98.75% Multivariate 99.25% Complex Four and Five Note Chords In order to test the robustness of the system, tests had to be carried out on more complex chords. Currently, the only four note chords tested were the Major 7ths. Although these are four note chords, the shape played produces a very clear and distinctive sound, therefore feature extraction was very simple. The complex chords introduced were the four note 6th chords, and the five note 6/9th chords. These chords were chosen as their shape is one that is relatively hard to play, which should introduce errors relating to missing notes. These chords are also played very close to the chords currently in the library, therefore some confusion is possible. The addition of the 6th and 6/9th chords had a detrimental effect on the accuracy scores. The accuracy of Nearest Neighbour fell to 80.75% for cosine distance and 79.11% for hamming and euclidean distance, as shown in Table 15. This was a disappointing result as the overall accuracy has dropped by a significant 15%. On studying the confusion matrix, it is obvious that the 6th chords were recorded poorly, as many examples only achieved around 12 out of 25 correct examples. This split suggests that possibly either the testing set, that contained 10 examples, or the training set which contained 15 examples, was of a poorer quality than the other set. The performance of the 6th notes is also disappointing when compared to the performance of the 6/9th chords, Reg:

38 which generally scored 18 out of 25. This result showed that when recordings were of a sufficient quality, the method can still achieve acceptable levels of accuracy. The accuracy of Naive Bayes still fell, with Multinomial scoring 83.47% and Multivariate scoring 84.5%, as shown in Table 15. Again, as with Nearest Neighbour, this was a significant drop in accuracy. It was also interesting having studied the confusion matrix that no real pattern in the the incorrect recognition could be found. The 6th chords were again those that failed on more occasions, although the disparity was not as profound as with Nearest Neighbour, as the 6th were successful on average 5 out of 10 times, compared to 7 out of 10 times for 6/9ths. The fact that Naive Bayes remained around 5% more accurate than Nearest Neighbour reinforced the advantage that a learnt approach has over an non-learnt approach. Table 15: Classification Accuracy for Complex Chords Method Accuracy Nearest Neighbour - Cosine 80.7% Nearest Neighbour - Hamming 79.11% Nearest Neighbour - Euclidean 79.11% Multinomial 83.47% Multivariate 84.5% Chords on Acoustic Guitar The pitch class profile classification methods were also tested on chords recorded on an acoustic guitar. The performance of these methods is expected to be lower as the system has been developed from the beginning to perform on an electric guitar. However, it is still a reasonable test of the robustness of the system. Using Nearest Neighbour, an accuracy of 60% was achieved for all distance measures. However using Naive Bayes, an accuracy of 77% was achieved using both Multinomial and Multivariate, as shown in Table 16. The confusion matrix produced is in appendix D. With Nearest Neighbour, it can be observed that where Major chords are recognised Reg:

39 almost all the time, the Minor chords are very poorly recognised which could be due to one of the higher strings being out of tune when the recordings were made. There were many occasions where Minor chords were recognised as chords a tone away, which suggests that one or two strings were out of tune. There were also some chords, namely F Major, F Major and C Major, which were recognised as chords outside the scope of the confusion matrix. This implies that there were either additional peaks that suggested a four or five note chord, or a missing peak that suggested a fifth chord. These results help to show the advantages of using a learnt approach, as the chords with missing, or extra peaks, were still classified correctly. Table 16: Classification Accuracy for Chords on the Acoustic Guitar Method Accuracy Nearest Neighbour - Cosine 60% Nearest Neighbour - Hamming 60% Nearest Neighbour - Euclidean 60% Multinomial 77.5% Multivariate 77.5% 6. Development of a Real-Time Application After testing had been completed, a real time system was developed in order to demonstrate the methods implemented in a real world situation. A Guided User Interface (GUI) was implemented in MATLAB, which could take a live recording of a chord, and process it in order to display back to the user the chord played. The GUI was developed in MATLAB because all the implementations of the classification techniques had already been written in MATLAB, therefore requiring little modification. Reg:

40 6.1. System Structure The system requires the several parameters at launch. Firstly, a PCP library, which is used for Nearest Neighbour classification. This is a matrix that contains all the possible chord examples. Secondly, a set of terms. This a cell array that contains the names of all the chords in the PCP library. Lastly, the training PCPs that contain all the training sets for the chords. The first stage in the system is the initialisation process. Here, all input parameters are assigned to handles, which means they can be accessed by the rest of the functions in the GUI. The conditional probabilities for Multinomial and Multivariate models are also calculated using the training PCPs. These are returned to handles and the system is ready to be used. The next stage relies on the user pressing the start recording button. When this is pressed, the getrecording function is called. This initialises an audiorecorder object and begins recording 7 seconds of data. The audio is recorded in stereo, as this is required when recording through a USB interface, and sampled at Hz. This sample rate was chosen as it is the same as the value used throughout the testing stage. The data is subsequently extracted from the audiorecorder object, converted to mono, and returned to the system. The audio data is then segmented in order to extract the chord from the audio which has been recorded. The segmenter is the same as the one used during testing and returns values for the position of the start, and end of the chord. At this point, the segmented chord is then plotted in the time domain. The chord is sent to the getpcp function. The getpcp function first applies a hamming window to the chord, and then applies the fft in order to extract the frequency spectrum. The peaks are then extracted and from these the PCP is formed. This function also returns the frequency spectrum, in addition to the peaks. The next process is to undertake the task of the matching. The method chosen depends on the user s choice. The matching function takes the PCP and returns the name of the chord match, the harmonic of the chord, as well as the PCPs of the top 5 matches. The chord name and harmonic are sent to the chord text box and the PCPs are sent to a table. The final stage is to play the audio back to the user. This process repeats when the start Reg:

recording button is pressed again. 6.2. User Interface Figure 6 shows the layout of the user interface after an A Major chord has been played.

41 recording button is pressed again User Interface Figure 6 shows the layout of the user interface after an A Major chord has been played. Figure 6: User interface of application The following are each of the elements contained in the UI: 1. Classification method - Allows the user to select the type of classification they wish to use. 2. Nearest neighbour similarity measure - A drop down menu that contains the different similarity measures available for Nearest Neighbour classification. 3. Start recording - User presses this to start the recording. The button turns red whilst the recording is in progress. 4. Play recording - Plays back the chord currently being held by the system. Plays nothing if there is no chord. 5. Peak threshold - Allows the user to alter the peak threshold, which is the amplitude at which values in the frequency spectrum are zeroed. If it is obvious from the frequency plot that low peaks are being missed then the user should lower this value. The default is 30. Reg:

42 6. Onset frame size - Changes the frame size used in the segmenter. This should be changed if a chord not found error is being returned. 7. Chord name - The name of the chord recognised as well as its harmonic. 8. Time plot - The time domain signal of the segmented chord. 9. Frequency plot - The frequency plot of the chord recognised with the detected peaks indicated. 10. PCP table - Contains the PCP of the chord together with the top 5 matches in addition to their names. 7. Discussion and Future Work 7.1. Discussion Overall, the methods investigated during the project all worked well. The accuracy achieved for single notes using the mean distance method was rather disappointing. The real purpose of this part of the project was to learn how to extract peaks from a frequency spectrum which would be needed to continue with the project. This was achieved, so time was well spent. Despite two note recognition managing to achieve 90% accuracy, at the finish there was a feeling of disappointment as two note recognition with PCPs actually produced near 100% accuracy. However, the techniques applied during the feature extraction process produced a solid base for the PCP feature extraction, and with hindsight, the time spent working on this method may have been better spent improving the PCP methods. The PCP methods were the greatest success of the project. Despite their relative simplicity, the methods were still able to achieve extremely high accuracy for the standard chords, and a decent accuracy for complex chords. It was interesting to note that during Nearest Neighbour tests, hamming distance and euclidean distance were identical. It was expected that hamming distance and cosine distance would be the same throughout, Reg:

43 and euclidean distance be the poor performer, but it proved to be the opposite. However, given the low number of test items, this difference is almost certainly not significant. It was pleasing that the accuracy of Nearest Neighbour remained very high, even when the chord library was extended. This meant that a reasonable comparison could be made between the learnt and non-learnt approaches. Classification with Naive Bayes was consistently the highest performer, always outperforming Nearest Neighbour by around 5%. When comparing the Multinomial model versus the Multivariate model, they were both almost identical. An extension of Naive Bayes which would have been interesting would have been to apply some feature selection when classifying, as this is generally a requirement of a Multivariate model. One of the main objectives of this project was to compare the effectiveness of learnt and non-learnt approaches to recognition. An non-learnt approach has the capacity to recognise any chord without the need for training, whereas a learnt approach requires training. In terms of pure accuracy, the learnt Bayesian approach had the highest accuracy of an almost perfect 99% for common chords. However, a large amount of training data would need to be acquired to apply this approach to real-time recognition and it would be hard to apply to other instruments as a whole other set of training examples would be needed. Therefore it could be argued that Nearest Neighbour would be the correct method choice as its accuracy was still high enough for use in a real time system. It also has the advantage of being more flexible to the expansion of the chord library, as only an example PCP is needed for a chord to be recognised. In addition it can be easily applied to other instruments as the alterations required to accommodate a new instrument are only needed at the feature extraction process. However, a large chord library presents the problem of slow recognition at real time, as hundreds of distances have to be calculated before being sorted. This problem can be solved by writing more efficient code, possibly using vectorisation in MATLAB. The demonstration developed was a success because it was easy to use and also performed very quickly in real time. This was a MATLAB interface, and therefore not portable. Reg:

44 7.2. Future Work In the future, it would be desirable to produce some real world applications for the system. Such an application would be a form of teaching tool, which would communicate to the user which chord to play and be able to inform them if the chord was correct. This could be extended further to other instruments beyond the guitar and could prove to be a very useful tool for self teaching musicians. It could also prove useful to implement this system onto mobile devices, in order to give musicians portable access to chord recognition, as there are currently almost no applications providing this kind of service. An application was studied during this project but the tools for recording pulse code modulated audio from an android phone are poorly documented, and lack of an Apple device meant that an iphone application would be impossible Summary In terms of the project proposal, the aims were to explore the possibility of being able to recognise three note chords and develop a MATLAB interface which would demonstrate this recognition. In reality, these aims were not only met, but exceeded, as the system developed is able to recognise chords with any number of notes, using both a template and probabilistic approach. Another aim was to test methods on both electric and acoustic guitars, which was successfully achieved. In terms of the project roadmap, everything was completed in the correct time period, however with hindsight, possibly less time spent working on the findpeaks function would have been beneficial as it was replaced by the peak picking function halfway through the project. Overall this was a good, enjoyable and successful project which produced many interesting results, and given time it would be fun to further develop the ideas explored. Reg:

45 References Alexander, M. J. (2014). The Practical Guide to Modern Music Theory for Guitarists: Second Edition. CreateSpace Independent Publishing Platform. Bello, J. P. and Pickens, J. (2005). A robust mid-level representation for harmonic content in music signals. In ISMIR, volume 5, pages Chafe, J. (1986). Source separation and note identification in polyphonic music. Centre for Computer Research in Music and Acoustics. Cooley, J. W. and Tukey, J. W. (1965). An algorithm for the machine calculation of complex fourier series. Mathematics of computation, 19(90): Cremer, M. (2004). A system for harmonic analysis of polyphonic music. In Audio Engineering Society Conference: 25th International Conference: Metadata for Audio. Audio Engineering Society. Fujishima, T. (1999). Realtime chord recognition of musical sound: A system using common lisp music. In Proc. ICMC, volume 1999, pages Gómez, E. (2006). Tonal description of music audio signals. PhD thesis, UPF Barcelona. Harris, F. J. (1978). On the use of windows for harmonic analysis with the discrete fourier transform. Proceedings of the IEEE, 66(1): Harte, C. and Sandler, M. (2005). Automatic chord identifcation using a quantised chromagram. In Audio Engineering Society Convention 118. Audio Engineering Society. Lee, K. (2006). Automatic chord recognition from audio using enhanced pitch class profile. In Proc. of the International Computer Music Conference. Lerdahl, F. and Jackendoff, R. (1985). A generative theory of tonal music. MIT press. Manning, C. D., Raghavan, P., and Schütze, H. (2008). Introduction to information retrieval, volume 1. Cambridge university press Cambridge. Reg:

46 Oudre, L., Févotte, C., and Grenier, Y. (2011a). volume 19, pages IEEE. Oudre, L., Grenier, Y., and Févotte, C. (2009). Template-based chord recognition: Influence of the chord types. In ISMIR, pages Oudre, L., Grenier, Y., and Févotte, C. (2011b). Chord recognition by fitting rescaled chroma vectors to chord templates. Audio, Speech, and Language Processing, IEEE Transactions on, 19(7): Rabiner, L. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2): Ryynänen, M. P. and Klapuri, A. P. (2008). Automatic transcription of melody, bass line, and chords in polyphonic music. Computer Music Journal, 32(3): Sheh, A. and Ellis, D. P. (2003). Chord segmentation and recognition using em-trained hidden markov models. ISMIR 2003, pages Stark, A. M. and Plumbley, M. D. (2009). Real-time chord recognition for live performance. Ann Arbor, MI: MPublishing, University of Michigan Library. Taylor, E. (1989). The AB Guide to Music Theory, Part 1 (Pt. 1). Associated Board of the Royal Schools of Music. Reg:

47 A. Single Note and Two Note Recognition Confusion Matrices Reg:

48 B. Nearest Neighbour Confusion Matrix Reg:

49 C. Naive Bayes Confusion Matrix Reg:

50 D. Nearest Neighbour and Naive Bayes for Acoustic Chords Reg:

Lecture 5: Pitch and Chord (1) Chord Recognition. Li Su

Lecture 5: Pitch and Chord (1) Chord Recognition Li Su Recap: short-time Fourier transform Given a discrete-time signal x(t) sampled at a rate f s. Let window size N samples, hop size H samples, then the