SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

Similar documents
Large-scale Music Identification Algorithms and Applications

Mikko Myllymäki and Tuomas Virtanen

AUTOMATIC identification of music has been the subject. Efficient and Robust Music Identification with Weighted Finite-State Transducers

Calibration of Microphone Arrays for Improved Speech Recognition

Voice Activity Detection

POSSIBLY the most noticeable difference when performing

High-speed Noise Cancellation with Microphone Array

Audio Imputation Using the Non-negative Hidden Markov Model

Implementing Speaker Recognition

Using RASTA in task independent TANDEM feature extraction

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

IDENTIFICATION OF SIGNATURES TRANSMITTED OVER RAYLEIGH FADING CHANNEL BY USING HMM AND RLE

Automatic Morse Code Recognition Under Low SNR

Change Point Determination in Audio Data Using Auditory Features

Drum Transcription Based on Independent Subspace Analysis

Speaker and Noise Independent Voice Activity Detection

Speech Synthesis using Mel-Cepstral Coefficient Feature

(i) Understanding the basic concepts of signal modeling, correlation, maximum likelihood estimation, least squares and iterative numerical methods

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

UNSUPERVISED SPEAKER CHANGE DETECTION FOR BROADCAST NEWS SEGMENTATION

An Approach to Very Low Bit Rate Speech Coding

Audio Similarity. Mark Zadel MUMT 611 March 8, Audio Similarity p.1/23

SIMULATION VOICE RECOGNITION SYSTEM FOR CONTROLING ROBOTIC APPLICATIONS

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

Applications of Music Processing

Sound Recognition. ~ CSE 352 Team 3 ~ Jason Park Evan Glover. Kevin Lui Aman Rawat. Prof. Anita Wasilewska

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Electric Guitar Pickups Recognition

SUB-BAND INDEPENDENT SUBSPACE ANALYSIS FOR DRUM TRANSCRIPTION. Derry FitzGerald, Eugene Coyle

Robust Speaker Identification for Meetings: UPC CLEAR 07 Meeting Room Evaluation System

Discriminative Training for Automatic Speech Recognition

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

An Optimization of Audio Classification and Segmentation using GASOM Algorithm

SOUND SOURCE RECOGNITION AND MODELING

Wavelet Packets Best Tree 4 Points Encoded (BTE) Features

Power Normalized Cepstral Coefficient for Speaker Diarization and Acoustic Echo Cancellation

Separating Voiced Segments from Music File using MFCC, ZCR and GMM

Robustness (cont.); End-to-end systems

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Cepstrum alanysis of speech signals

An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation

Chapter 4 SPEECH ENHANCEMENT

Joint recognition and direction-of-arrival estimation of simultaneous meetingroom acoustic events

Modulation Spectrum Power-law Expansion for Robust Speech Recognition

Combining Pitch-Based Inference and Non-Negative Spectrogram Factorization in Separating Vocals from Polyphonic Music

Robust Speech Recognition Group Carnegie Mellon University. Telephone: Fax:

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

A Parametric Model for Spectral Sound Synthesis of Musical Sounds

BEAT DETECTION BY DYNAMIC PROGRAMMING. Racquel Ivy Awuor

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Mel Spectrum Analysis of Speech Recognition using Single Microphone

24 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 1, JANUARY /$ IEEE

Campus Location Recognition using Audio Signals

Query by Singing and Humming

Performance Analysiss of Speech Enhancement Algorithm for Robust Speech Recognition System

An Audio Fingerprint Algorithm Based on Statistical Characteristics of db4 Wavelet

ADAPTIVE NOISE LEVEL ESTIMATION

Study on the UWB Rader Synchronization Technology

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Monophony/Polyphony Classification System using Fourier of Fourier Transform

Advanced Music Content Analysis

Introduction of Audio and Music

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

Enhanced voice recognition to reduce fraudulence in ATM machine

Automatic Transcription of Monophonic Audio to MIDI

Adaptive noise level estimation

Antennas and Propagation. Chapter 6b: Path Models Rayleigh, Rician Fading, MIMO

Speech Coding in the Frequency Domain

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

CS 188: Artificial Intelligence Spring Speech in an Hour

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

Segmentation of Fingerprint Images

Audio Classification by Search of Primary Components

MPEG-4 Structured Audio Systems

Advanced Signal Processing and Digital Noise Reduction

Advanced Techniques for Mobile Robotics Location-Based Activity Recognition

THE goal of Speaker Diarization is to segment audio

IMPROVING ACCURACY OF POLYPHONIC MUSIC-TO-SCORE ALIGNMENT

Call Quality Measurement for Telecommunication Network and Proposition of Tariff Rates

IMPROVEMENTS TO THE IBM SPEECH ACTIVITY DETECTION SYSTEM FOR THE DARPA RATS PROGRAM

Collaborative Classification of Multiple Ground Vehicles in Wireless Sensor Networks Based on Acoustic Signals

Advanced audio analysis. Martin Gasser

SINOLA: A New Analysis/Synthesis Method using Spectrum Peak Shape Distortion, Phase and Reassigned Spectrum

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Speech and Music Discrimination based on Signal Modulation Spectrum.

Long Range Acoustic Classification

A SCALABLE AUDIO FINGERPRINT METHOD WITH ROBUSTNESS TO PITCH-SHIFTING

UNEQUAL POWER ALLOCATION FOR JPEG TRANSMISSION OVER MIMO SYSTEMS. Muhammad F. Sabir, Robert W. Heath Jr. and Alan C. Bovik

KONKANI SPEECH RECOGNITION USING HILBERT-HUANG TRANSFORM

Optical Channel Access Security based on Automatic Speaker Recognition

Determining Guava Freshness by Flicking Signal Recognition Using HMM Acoustic Models

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

MIMO Receiver Design in Impulsive Noise

Transcription:

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS AKSHAY CHANDRASHEKARAN ANOOP RAMAKRISHNA akshayc@cmu.edu anoopr@andrew.cmu.edu ABHISHEK JAIN GE YANG ajain2@andrew.cmu.edu younger@cmu.edu NIDHI KOHLI R SWAMINATHAN nkohli@andrew.cmu.edu rswamin1@andrew.cmu.edu TEJAS CHOPRA YITING XIE tchopra@andrew.cmu.edu yitingx@andrew.cmu.edu Abstract This project proposes a novel method to retrieve the song corresponding to a recording of a snippet of the same. The method used converts the song into a string of music phonemes, which are MFCC Feature Vectors using Hidden Markov Models. The sequence of these phones are then stored. A similar procedure is performed upon the snippets. The snippet sequence and song sequence are then compared using a sliding window method with the Levenshtein edit distance as the score metric. This effectively reduces the problem to a string search problem. The usage of HMMs adds robustness against noise. The application for these projects include user based content search, song retrieval systems like Shazam etc. 1 Introduction Music identification involves determining the identity of the song by matching a partial recording given by a user. These systems can be used by content distribution networks like Google and Youtube to identify copyrighted audio within their systems. The task of identification is difficult due to lack of clarity of the snippet as it may have many components of noise in it. Secondly, it also requires that the system be efficient in terms of memory and time which is usually a difficult task given the

sheer size and number of songs in a typical database. The previous approaches to retrieving songs and matching music were based on a hash-search algorithm which required an exact match between the snippet and the song. Hence, in the presence of noise, the system would yield incorrect results. Another method for song retrieval involves decoding the Mel-fequency Cepstral Component(MFCC) [1] features over the audio stream into a sequence of audio events wherein the matching is driven by Hidden Markov Models(HMM)[2]. The system, however looks for atomic sound sequences of a particular length so that the complexity can be reduced.[3] A more recent paper by E. Weinstein et al [4] use HMMs for song representation and then reduce the space complexity by using WFSTs. The method used by us follows steps similar to [4]. However, it diverges from the procedure in the final stages for Database generation and song retrieval. As a baseline, initially the task was performed using a general spatial correlation, correlation of the Fourier transforms of the songs and the snippets and the Short-time Fourier transforms of the same. Though the methods are simple to understand and implement, they are highly inefficient in terms of physical space requirement and time complexity. Hidden Markov Models, whose observations are Gaussians or mixture of Gaussians, are an effective tool to statistically model the dynamic causal data and hence, it is used in the proposed song retrieval system. 2 Music identification Unlike speech, which has a set of well defined phonemes, music is essentially a non-stationary stochastic process and has time and space varying joint PDFs. This is because music, being polyphonic, is composed of several random variables like amplitude, frequency, phase, etc. Non stationary processes have varying mean and variances, so we break the songs into several quasistationary segments. These songs are to be represented by a set of music phonemes. 2.1 Acoustic Modeling The acoustic modeling involves learning a joint set of music phonemes and finding a set of phonemes that best represent each song. First, the Mel-Frequency Cepstral Coefficient(MFCCs) Features for each song is computed. We use 100 ms windows with a hop size of 25 ms over the feature stream. The first 12 coefficients, their energy, and their first and second derivatives are found, thus resulting in a 39 dimensional feature vector. 2.2 Model Initializa tion 2.2.1 Seg menta tion To find the initial set of segmentations, the audio is segmented by sliding a window over the feature stream and fitting a diagonal covariance Gaussian to each window. Then, the KL Divergence is calculated between adjacent windows. The KL Divergence can be calculated as: If the value of KL divergence between adjacent windows is above some empirically determined threshold, the song is segmented at those points. This is done over all the songs to generate a set of segments as shown in figure 1. Fig.1: Segmentation in songs

2.2.2 Clustering The means of all the segments are then taken and a K means clustering is performed on them. The number of clusters is predetermined, and is taken as a power of 2. Corresponding number of HMMs are trained using these segments. These HMMs are not ergodic. Each state of the HMM has only self transition or can transit to only the next adjacent state. The observations of each state are independent Gaussians with diagonal covariance. 2.2.3 Phoneme Generation Next, all the segments are decoded using the HMMs and are assigned to that HMM for which they give the maximum log likelihood, as shown in figure 2. HMMs are then trained using the reassigned segments for clusters Thus, this method results in the generation of quasi-stationary music phonemes which can be used to represent this database. 2.2.4 State Flow Model Fig.2: Creation of HMMs after clustering At this point, we have all the segments assigned to different HMMs. Our next step is to combine these different HMMs into a single large HMM called the State Flow HMM which has parallel branches, with each branch representing the HMM of one cluster and with their starts and ends being tied together via non-emitting states, as shown in figure 3. Fig.3: State Flow Model HMM

2.2.5 Song Repres entation After generating the state flow model, the MFCC feature matrix of each song is passed through the state flow model and its path i.e. the HMMs it goes through from the start to the end is found. Thus, an uncompressed song in WAV file format, which is usually an array nearly a million bytes long, has been represented by a string of symbols. 2.2.6. Song Retri eval At this point we have managed to represent each song as a string of alpha-numeric characters, and we can save this to memory, and this is in fact a one time job, to develop the data base. Now when we receive a snippet from a user, we simply run the snippet through the State Flow HMM and retrieve a transcription for that as well. We now simply calculate an edit distance between the snippet transcription and that of all the songs in the database, the song with the lowest edit distance is classified as the best match. 3 Results Spatial Correlation STFT Correlation HMM based Retrieval Noiseless Snippets 0.72 0.83 0.92 Noisy Snippets 0.51 0.55 0.60 Table 1: Comparison of accuracy of Various methods for noiseless and noisy signals The baseline case involved applying correlation between the snippet and the songs. Although the results were not very bad, it was a computationally expensive method and consumed a lot of memory space and time. This was further refined to an extent by the short-time Fourier transform method. Slightly better results were obtained using the STFT based correlation. However, the complexity was very large. Our algorithm showed an improvement over both the methods in terms of efficiency, speed, memory space occupied and the accuracy. The results have been statistically determined and tabulated as shown in Table 1. 4 Conclusion This method was able to successfully match noiseless snippets to songs with extremely high accuracy. This approach did not rely on an exact match between the snippet and the song. We have music phone sequences of variable length depending on the clustering and segmentation exercise we performed for each song. As a part of the project we have been successful at effectively transcribing the songs into music units. Usage of HMMs adds robustness to the matching procedure. The string search method incorporated also takes into consideration changes in the expected value of the snippet string due to noise. However, further improvement can be achieved in case of noiseless signals. Though this method s training takes up a longer time, this is offset by the speed of online song retrieval. 5 Future Work Though this algorithm gives a very high accuracy for noiseless snippets, its performance in the presence of noise is however, lacking. Though the algorithm has a better temporal efficiency compared to other methods, it can still be improved further. A further possible step is to implement other more efficient and accurate string search algorithms. 6 Acknowledgment We would like to sincerely thank Professor Bhiksha Raj, LTI, Carnegie Mellon University for his critical suggestions and help. His constant motivation was the driving factor which enabled us to complete the project.

7 References [1]. E. Weinstein and P. Moreno, Music identification with weighted finite-state transducers, International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Honolulu, Hawaii, 2007. [2] A tutorial on hidden markov models and selected applications in speech recognition (1989), Lawrence R. Rabiner, Proceedings of the IEEE [3] E. Batlle, J. Masip, and E. Guaus, Automatic song identification in noisy broadcast audio, in IASTED International Conference on Signal and Image Processing, Kauai, Hawaii, 2002. [4] Mehryar Mohri, Pedro Moreno, and Eugene Weinstein. Efficient and Robust Music Identification with Weighted Finite-State Transducers, IEEE Transactions on Audio, Speech, and Language Processing, 2009 [5] "Binary codes capable of correcting deletions, insertions, and reversals". Soviet Physics Doklady, Vladmir Levenshtein, 1966