Automatic Transcription of Monophonic Audio to MIDI

Similar documents
Transcription of Piano Music

Drum Transcription Based on Independent Subspace Analysis

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

COMPUTATIONAL RHYTHM AND BEAT ANALYSIS Nicholas Berkner. University of Rochester

Monophony/Polyphony Classification System using Fourier of Fourier Transform

Automatic Evaluation of Hindustani Learner s SARGAM Practice

Signal segmentation and waveform characterization. Biosignal processing, S Autumn 2012

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

POLYPHONIC PITCH DETECTION BY MATCHING SPECTRAL AND AUTOCORRELATION PEAKS. Sebastian Kraft, Udo Zölzer

Mel Spectrum Analysis of Speech Recognition using Single Microphone

REpeating Pattern Extraction Technique (REPET)

Aberehe Niguse Gebru ABSTRACT. Keywords Autocorrelation, MATLAB, Music education, Pitch Detection, Wavelet

Onset Detection Revisited

Guitar Music Transcription from Silent Video. Temporal Segmentation - Implementation Details

Narrow-Band Interference Rejection in DS/CDMA Systems Using Adaptive (QRD-LSL)-Based Nonlinear ACM Interpolators

Applications of Music Processing

Audio Restoration Based on DSP Tools

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

A Novel Technique or Blind Bandwidth Estimation of the Radio Communication Signal

BEAT DETECTION BY DYNAMIC PROGRAMMING. Racquel Ivy Awuor

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Multipitch estimation using judge-based model

NCCF ACF. cepstrum coef. error signal > samples

Audio Engineering Society Convention Paper Presented at the 110th Convention 2001 May Amsterdam, The Netherlands

FIBER OPTICS. Prof. R.K. Shevgaonkar. Department of Electrical Engineering. Indian Institute of Technology, Bombay. Lecture: 22.

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Speech Synthesis using Mel-Cepstral Coefficient Feature

Introduction. Chapter Time-Varying Signals

Application of The Wavelet Transform In The Processing of Musical Signals

A New Adaptive Channel Estimation for Frequency Selective Time Varying Fading OFDM Channels

AMUSIC signal can be considered as a succession of musical

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

REAL-TIME BEAT-SYNCHRONOUS ANALYSIS OF MUSICAL AUDIO

Real-time fundamental frequency estimation by least-square fitting. IEEE Transactions on Speech and Audio Processing, 1997, v. 5 n. 2, p.

VU Signal and Image Processing. Torsten Möller + Hrvoje Bogunović + Raphael Sahann

MULTIPLE F0 ESTIMATION IN THE TRANSFORM DOMAIN

AUTOMATED MUSIC TRACK GENERATION

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Real time noise-speech discrimination in time domain for speech recognition application

Voice Activity Detection for Speech Enhancement Applications

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Lecture 6. Rhythm Analysis. (some slides are adapted from Zafar Rafii and some figures are from Meinard Mueller)

Automotive three-microphone voice activity detector and noise-canceller

Analysis of Processing Parameters of GPS Signal Acquisition Scheme

Advanced audio analysis. Martin Gasser

Lecture Fundamentals of Data and signals

Mikko Myllymäki and Tuomas Virtanen

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

An Efficient Color Image Segmentation using Edge Detection and Thresholding Methods

MUSICAL GENRE CLASSIFICATION OF AUDIO DATA USING SOURCE SEPARATION TECHNIQUES. P.S. Lampropoulou, A.S. Lampropoulos and G.A.

Separation and Recognition of multiple sound source using Pulsed Neuron Model

2.1 BASIC CONCEPTS Basic Operations on Signals Time Shifting. Figure 2.2 Time shifting of a signal. Time Reversal.

THE CITADEL THE MILITARY COLLEGE OF SOUTH CAROLINA. Department of Electrical and Computer Engineering. ELEC 423 Digital Signal Processing

SOUND SOURCE RECOGNITION AND MODELING

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

Improved SIFT Matching for Image Pairs with a Scale Difference

Multiple Fundamental Frequency Estimation by Modeling Spectral Peaks and Non-peak Regions

Detection, localization, and classification of power quality disturbances using discrete wavelet transform technique

Complex Sounds. Reading: Yost Ch. 4

DESIGN AND IMPLEMENTATION OF AN ALGORITHM FOR MODULATION IDENTIFICATION OF ANALOG AND DIGITAL SIGNALS

Discrete Fourier Transform (DFT)

FFT 1 /n octave analysis wavelet

Supplementary Materials for

(i) Understanding the basic concepts of signal modeling, correlation, maximum likelihood estimation, least squares and iterative numerical methods

MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS

14 fasttest. Multitone Audio Analyzer. Multitone and Synchronous FFT Concepts

A multi-class method for detecting audio events in news broadcasts

Basic Characteristics of Speech Signal Analysis

Using Audio Onset Detection Algorithms

MUSIC is to a great extent an event-based phenomenon for

New Features of IEEE Std Digitizing Waveform Recorders

3432 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 10, OCTOBER 2007

Laboratory Assignment 4. Fourier Sound Synthesis

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

SGN Audio and Speech Processing

A Novel Fuzzy Neural Network Based Distance Relaying Scheme

Identification of Nonstationary Audio Signals Using the FFT, with Application to Analysis-based Synthesis of Sound

Tempo and Beat Tracking

REAL-TIME BROADBAND NOISE REDUCTION

A NEW APPROACH TO TRANSIENT PROCESSING IN THE PHASE VOCODER. Axel Röbel. IRCAM, Analysis-Synthesis Team, France

Minimising latency of pitch detection algorithms for live vocals on low-cost hardware

A JOINT MODULATION IDENTIFICATION AND FREQUENCY OFFSET CORRECTION ALGORITHM FOR QAM SYSTEMS

CHAPTER 4 VOICE ACTIVITY DETECTION ALGORITHMS

DSP First. Laboratory Exercise #11. Extracting Frequencies of Musical Tones

ECMA TR/105. A Shaped Noise File Representative of Speech. 1 st Edition / December Reference number ECMA TR/12:2009

Detection of an LTE Signal Based on Constant False Alarm Rate Methods and Constant Amplitude Zero Autocorrelation Sequence

SPEECH TO SINGING SYNTHESIS SYSTEM. Mingqing Yun, Yoon mo Yang, Yufei Zhang. Department of Electrical and Computer Engineering University of Rochester

Multiple Sound Sources Localization Using Energetic Analysis Method

DIGITAL IMAGE PROCESSING Quiz exercises preparation for the midterm exam

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

COMPARING ONSET DETECTION & PERCEPTUAL ATTACK TIME

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

THE BENEFITS OF DSP LOCK-IN AMPLIFIERS

Audio Fingerprinting using Fractional Fourier Transform

8.3 Basic Parameters for Audio

EE 215 Semester Project SPECTRAL ANALYSIS USING FOURIER TRANSFORM

1 line

Enhanced Harmonic Content and Vocal Note Based Predominant Melody Extraction from Vocal Polyphonic Music Signals

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

Transcription:

Automatic Transcription of Monophonic Audio to MIDI Jiří Vass 1 and Hadas Ofir 2 1 Czech Technical University in Prague, Faculty of Electrical Engineering Department of Measurement vassj@fel.cvut.cz 2 Technion - Israel Institute of Technology, Department of Electrical Engineering Signal and Image Processing Laboratory hadaso@siglab.technion.ac.il Abstract This paper presents an automatic system for transcription of monophonic audio signals into the MIDI (Musical Instrument Digital Interface) representation. The system incorporates two separate algorithms in order to extract the necessary musical information from the audio signal. The detection of the fundamental frequency is based on a pattern recognition method applied on the constant Q spectral transform. The onset detection is achieved by a sequential algorithm based on computing a statistical distance measure between two autoregressive models. The results of both algorithms are combined by heuristic rules eliminating the transcription errors in a natural manner. The method is suitable for rapid musical passages, able to deal with various musical sounds, and applicable within a wide range of MIDI frequencies. 1 Introduction Transcription of music is a task of converting a particular piece of music into a symbolic representation, such as standard musical notation or MIDI file format. From this point of view, music can be classified as polyphonic and monophonic. The former consists of multiple simultaneously sounding notes, whereas the latter contains only a single note at each time instant, such as a saxophone solo. Monophonic transcription can thus be understood as a special simple case of polyphonic transcription, and is considered as practically solved [1]. On the other hand, it represents an important case to be treated separately with much stricter demands on the transcription quality (which still seems to be relatively limited for polyphonic transcribers). Moreover, specific applications for monophonic systems comprise tools for solo musicians [3], low bit-rate audio coding [4], and monophonic ring tones for cellular phones.

Each musical note can be described by three essential parameters: the fundamental frequency (pitch), the beginning of a note (onset time), and the note duration. For this reason, a transcription system should include both pitch tracker and onset detector, although not necessarily implemented as two separate blocks. Previous papers tend to describe techniques for pitch and onset detection rather separately, so only few compact and reliable monophonic transcribers were published (for example [3]). Furthermore, transcribers based on autocorrelation [4] suffer the common time-frequency tradeoff, and cannot be applied to a wide range of frequencies [2], which is an essential property of audio signals. Our solution is therefore based on the Constant Q Transform (CQT) offering adjustable frequency range and excellent time-frequency resolution, which is further improved using an onset detector on a sample-by-sample basis. 2 The Transcription System Figure 1 depicts the basic building blocks of the transcription system. Pitch Detection Input Signal Pre- processing Detection of Events Combining the Results Notes to MIDI MIDI File Estimation of Power Figure 1. Transcription System The Pre-processing block normalizes the input signal to the unit power, and inserts silence before the beginning and after the end of the signal. The Pitch Detection block is primarily responsible for tracking the fundamental frequency in the signal, but also contributes to the determination of the note onsets and offsets, which is the principal task of the Detection of Events block (whose output can conversely affect the

fundamental frequency tracking). Since none of these blocks yield ideal results, the Estimation of Power block is added to provide supportive data for event detection. The Combining the Results block then processes the outputs of the preceding sections, and generates the complete event list, which is finally converted to a MIDI file by the Notes to MIDI block developed by [2]. 2.1 Pitch Detection Since the musical frequencies form a geometric series, it is desirable to represent the signal with a spectral transform corresponding to a filter bank with center frequencies spaced exponentially. Such transform was developed by [5] and is referred to as Constant Q Transform (CQT). Similarly as in the DFT, the frequency range is divided into frequency bins, each represented by a bandpass filter with a center frequency f k and filter bandwidth f k. However, the CQT bins are geometrically spaced, which results in variable resolution at different octaves and constant quality factor Q = f k / f k. With b being the number of filters per octave, the CQT filter bank is equivalent to 1/b th octave filter bank, which shows its relationship with the wavelet transform [2]. The CQT spectral components form a constant pattern when plotted against the logarithmic frequency axis, which is evident from the following equation: log (f m ) log (f n ) = log ( ) fm f n ( ) m F0 = log = log m (1) n n F 0 In other words, the relative positions between harmonics are constant for all musical notes, and only the absolute positions depend on the fundamental frequency F 0. This property can be employed to determine the fundamental frequency as a maximum value of the cross-correlation function between an ideal (theoretical) pattern and the pattern in the actual CQT spectrum [6], as depicted on Fig.2. 2.2 Detection Of Events The algorithm for detection of acoustic changes (events) is based on [7], and was also successfully applied on audio signals [8]. Great advantage is the statistical time-domain approach performed on a sample-by-sample

Ideal Pattern Input Signal Windowing Constant Q Transform Cross- correlation Peak Picking Median filtering Pitch Pattern Recognition Figure 2. Pitch Detection System basis, thus providing very accurate locations of the onset and offset times. The main idea is to model the signal by two autoregressive (AR) models, between which a suitable distance measure is monitored, and a new event detected whenever the distance exceeds a specific threshold value. Since the distance measure is conditional Kullback s divergence, the algorithm is commonly referred to as the divergence test. 2.3 Estimation of Power Since the divergence test provides no information about the origin of the detected acoustic event, it is necessary to estimate the signal power in order to reliably distinguish between the true and false onsets. This is achieved using the leaky integrator which computes the recursive estimate of power [9]. The resulting power signal is then iteratively smoothed to locate its most significant minima, which are subsequently utilized in the final decision procedure. 2.4 Combining the Results This block processes the outputs of the preceding sections and applies several heuristic rules to correctly choose the best candidate for the onset time and MIDI frequency of each note. These rules form an eliminative competition between two sets of candidates obtained by the CQT and the divergence test (shown on Fig.3). Rule 1 The first rule can briefly be summarized as: The winner is the nearest. Specifically, the algorithm sequentially processes the candidates from the CQT segmentation, and assigns to each member the nearest candidate from the divergence test.

Rule 2 The correct candidates from the divergence test are typically located in the vicinity of power minima preceding the note attacks (see Sec.2.3). This property is thus the necessary condition of the second rule, which allows an additional acceptance of the divergence test candidates rejected by Rule 1. This rule enables detection of successive notes of the same frequency, and hence allows the time segmentation which would not be possible solely with the CQT approach. Rule 3 The third rule is based on the observation that monophonic signals contain an attack between each two consecutive onsets. In other words, the power must have at least one local maximum to consider such onsets as beginnings of two separate notes. 0.5 signal CQT seg div seg amplitude [V] 0 0.5 1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 time [s] 76 74 MIDI note # [ ] 72 70 68 66 CQT div seg 1.2 1.3 1.4 1.5 1.6 1.7 1.8 time [s] Figure 3. Results of CQT and the divergence test to be processed by heuristic rules

3 Conclusion This paper presents a compact system for transcription of monophonic audio signals to MIDI representation. The time-frequency resolution of the pitch tracker is improved by a sequential onset detector, thus resulting in accurate transcription both in time and frequency. Moreover, the new method applies several heuristic rules to eliminate the errors of both pitchand onset-detection algorithms. Such approach avoids the conventional parameter of minimum note duration, and is therefore especially suitable for transcription of fast musical passages. References 1. Klapuri, A.: Automatic Transcription of Music. MSc. Thesis, Tampere University of Technology, 1998 2. Cemgil, A.T.: Automated Monophonic Music Transcription (A Wavelet Theoretical Approach). MSc. Thesis, Bogazici University, 1995 3. Bořil, H.: Kytarový MIDI převodník. MSc. Thesis, Czech Technical University, 2003 4. Bello, J. P., Monti, G., Sandler, M. B.: Automatic Music Transcription and Audio Source Separation. Cybernetics and Systems: An International Journal, 2002 5. Brown, J.: Calculation of a Constant Q Spectral Transform. Journal of the Acoustic Society America, Jan. 1991 6. Brown, J.: Musical Fundamental Frequency Tracking using a Pattern Recognition Method. Journal of the Acoustic Society America, Sep. 1992 7. Basseville, M., Benveniste, A.: Sequential Detection of Abrupt Changes in Spectral Characteristics of Digital Signals. IEEE Transactions on Information Theory, Sep. 1983 8. Jehan, T.: Musical Signal Parameter Estimation. Msc. Thesis, Berkeley, 1997 9. Sovka, P., Pollák, P.: Vybrané metody číslicového zpracování signálů. Czech Technical University, 2003