REpeating Pattern Extraction Technique (REPET)

Similar documents
Pitch Estimation of Singing Voice From Monaural Popular Music Recordings

Adaptive filtering for music/voice separation exploiting the repeating musical structure

ONLINE REPET-SIM FOR REAL-TIME SPEECH ENHANCEMENT

Combining Pitch-Based Inference and Non-Negative Spectrogram Factorization in Separating Vocals from Polyphonic Music

Rhythm Analysis in Music

Rhythm Analysis in Music

Monaural and Binaural Speech Separation

Applications of Music Processing

Harmonic-Percussive Source Separation of Polyphonic Music by Suppressing Impulsive Noise Events

Separation of Vocal and Non-Vocal Components from Audio Clip Using Correlated Repeated Mask (CRM)

Lecture 6. Rhythm Analysis. (some slides are adapted from Zafar Rafii and some figures are from Meinard Mueller)

Time- frequency Masking

An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation

Drum Transcription Based on Independent Subspace Analysis

Enhanced Harmonic Content and Vocal Note Based Predominant Melody Extraction from Vocal Polyphonic Music Signals

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Single-channel Mixture Decomposition using Bayesian Harmonic Models

Rhythm Analysis in Music

Lecture 14: Source Separation

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Audio Imputation Using the Non-negative Hidden Markov Model

MUSICAL GENRE CLASSIFICATION OF AUDIO DATA USING SOURCE SEPARATION TECHNIQUES. P.S. Lampropoulou, A.S. Lampropoulos and G.A.

ScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

Automatic Transcription of Monophonic Audio to MIDI

SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING DEEP RECURRENT NEURAL NETWORKS

BEAT DETECTION BY DYNAMIC PROGRAMMING. Racquel Ivy Awuor

Introduction of Audio and Music

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

Discriminative Enhancement for Single Channel Audio Source Separation using Deep Neural Networks

Music Signal Processing

A MULTI-RESOLUTION APPROACH TO COMMON FATE-BASED AUDIO SEPARATION

Toward Automatic Transcription -- Pitch Tracking In Polyphonic Environment

Separating Voiced Segments from Music File using MFCC, ZCR and GMM

Tempo and Beat Tracking

SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING DEEP RECURRENT NEURAL NETWORKS

Study of Algorithms for Separation of Singing Voice from Music

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Communications Theory and Engineering

SINGLE CHANNEL AUDIO SOURCE SEPARATION USING CONVOLUTIONAL DENOISING AUTOENCODERS. Emad M. Grais and Mark D. Plumbley

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

Survey Paper on Music Beat Tracking

Speech/Music Change Point Detection using Sonogram and AANN

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Speech Synthesis using Mel-Cepstral Coefficient Feature

Reducing Interference with Phase Recovery in DNN-based Monaural Singing Voice Separation

Advanced audio analysis. Martin Gasser

NCCF ACF. cepstrum coef. error signal > samples

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Transcription of Piano Music

PRIMARY-AMBIENT SOURCE SEPARATION FOR UPMIXING TO SURROUND SOUND SYSTEMS

ROBUST F0 ESTIMATION IN NOISY SPEECH SIGNALS USING SHIFT AUTOCORRELATION. Frank Kurth, Alessia Cornaggia-Urrigshardt and Sebastian Urrigshardt

A Novel Approach to Separation of Musical Signal Sources by NMF

SPEECH TO SINGING SYNTHESIS SYSTEM. Mingqing Yun, Yoon mo Yang, Yufei Zhang. Department of Electrical and Computer Engineering University of Rochester

Tempo and Beat Tracking

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Real-time Speech Enhancement with GCC-NMF

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

arxiv: v1 [cs.sd] 15 Jun 2017

Binaural Hearing. Reading: Yost Ch. 12

ADAPTIVE NOISE LEVEL ESTIMATION

Automatic Evaluation of Hindustani Learner s SARGAM Practice

An Audio Fingerprint Algorithm Based on Statistical Characteristics of db4 Wavelet

Raw Multi-Channel Audio Source Separation using Multi-Resolution Convolutional Auto-Encoders

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

arxiv: v1 [cs.sd] 24 May 2016

Audio Engineering Society Convention Paper Presented at the 110th Convention 2001 May Amsterdam, The Netherlands

arxiv: v1 [cs.sd] 29 Jun 2017

Vocality-Sensitive Melody Extraction from Popular Songs

Voiced/nonvoiced detection based on robustness of voiced epochs

Multiple Sound Sources Localization Using Energetic Analysis Method

SUB-BAND INDEPENDENT SUBSPACE ANALYSIS FOR DRUM TRANSCRIPTION. Derry FitzGerald, Eugene Coyle

Singing Expression Transfer from One Voice to Another for a Given Song

arxiv: v1 [cs.sd] 3 May 2018

Adaptive noise level estimation

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

VQ Source Models: Perceptual & Phase Issues

Improving reverberant speech separation with binaural cues using temporal context and convolutional neural networks

IN a natural environment, speech often occurs simultaneously. Monaural Speech Segregation Based on Pitch Tracking and Amplitude Modulation

Analytical Analysis of Disturbed Radio Broadcast

Voice Activity Detection for Speech Enhancement Applications

Different Approaches of Spectral Subtraction Method for Speech Enhancement

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

EXPLORING PHASE INFORMATION IN SOUND SOURCE SEPARATION APPLICATIONS

Spatialization and Timbre for Effective Auditory Graphing

CO-CHANNEL SPEECH DETECTION APPROACHES USING CYCLOSTATIONARITY OR WAVELET TRANSFORM

POLYPHONIC PITCH DETECTION BY MATCHING SPECTRAL AND AUTOCORRELATION PEAKS. Sebastian Kraft, Udo Zölzer

SGN Audio and Speech Processing

A Method for Voiced/Unvoiced Classification of Noisy Speech by Analyzing Time-Domain Features of Spectrogram Image

SGN Audio and Speech Processing

Minimal-Impact Audio-Based Personal Archives

Convention Paper 7024 Presented at the 122th Convention 2007 May 5 8 Vienna, Austria

Speech Enhancement Techniques using Wiener Filter and Subspace Filter

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

CSC475 Music Information Retrieval

Basic Characteristics of Speech Signal Analysis

The Music Retrieval Method Based on The Audio Feature Analysis Technique with The Real World Polyphonic Music

DISCRIMINATION OF SITAR AND TABLA STROKES IN INSTRUMENTAL CONCERTS USING SPECTRAL FEATURES

Transcription:

REpeating Pattern Extraction Technique (REPET) EECS 32: Machine Perception of Music & Audio Zafar RAFII, Spring 22

Repetition Repetition is a fundamental element in generating and perceiving structure Propellerheads - History Repeating - 2 4 6 8 2 Zafar RAFII, Spring 22 2

Repetition Repetition is a fundamental element in generating and perceiving structure Propellerheads - History Repeating - 2 4 6 8 2 Zafar RAFII, Spring 22 3

Repetition Repetitions happen in audio in general Music Repetitive noises Auditory grouping etc. Zafar RAFII, Spring 22 4

Repetition Repetitions happen in art in general Painting Sculpture Architecture etc. Zafar RAFII, Spring 22

Repetition Repetitions happen in nature in general Animals Plants Objects etc. Zafar RAFII, Spring 22 6

Repetition Musical pieces are generally characterized by an underlying repeating structure over which varying elements are superimposed Propellerheads - History Repeating - 2 4 6 8 2 Zafar RAFII, Spring 22 7

Repetition This means there should be patterns that are more or less repeating in time and frequency Mixture Spectrogram. 2...9.9... 2 4 6 8 2 High energy Low energy Zafar RAFII, Spring 22 8

Repetition The (more or less) repeating patterns could be identified using a time-frequency mask Time-Frequency Mask. 2...9.9... 2 4 6 8 2 = +repeating = -repeating Zafar RAFII, Spring 22 9

Repetition The mask could be applied on the mixture to extract the (more or less) repeating patterns Repeating Spectrogram. 2...9.9... 2 4 6 8 2 High energy Low energy Zafar RAFII, Spring 22

Repetition REpeating Pattern Extraction Technique!. Identify the repeating period 2. Model the repeating segment 3. Extract the repeating structure Simple music/voice separation method! Repeating structure = musical background Non-repeating structure = vocal foreground Zafar RAFII, Spring 22

Step 3 Step 2 Step REPET Mixture Signal x Mixture Spectrogram V Beat Spectrum b.8.9.6.8.4.2 -.2 2 2 3 3.7.6..4 -.4 4.3 -.6 4.2 -.8. -.. 2 2. 3 3. 4 4.. 6 2 3 4 6 p 2 3 4 6 V Median Repeating Segment S 2 2 3 3 4 4 p 2 3 2p 4 6 2 2 3 3 4 4 2 2 2 2 3 3 3 3 4 4 4 4.2.4.6.8.2.4.6.8 2.2.4.6.8.2.4.6.8 2.2.4.6.8.2.4.6.8 2 2 2 3 3 4 4.2.4.6.8.2.4.6.8 2 S V Repeating Spectrogram W Time-Frequency Mask M 2 2 3 2 3 2 4 3 4 3 4 2 3 4 6 4.2.4.6.8.2.4.6.8 2 min min min 2 2 3 3 4 4 2 2 3 3 4 4 2 3 4 6 2 3 4 6 Zafar RAFII, Spring 22 2

Practical Advantages Not feature-dependent Does not rely on complex frameworks Does not require prior training Zafar RAFII, Spring 22 3

Practical Interests Instrument/vocalist identification Pitch/melody transcription Karaoke gaming Zafar RAFII, Spring 22 4

Intellectual Interests Music understanding Music perception Simply based on repetition! Zafar RAFII, Spring 22

REPET Parallel with background subtraction in vision Compare frames to estimate a background model Zafar RAFII, Spring 22 6

REPET Parallel with background subtraction in vision Extract the background from the foreground Zafar RAFII, Spring 22 7

REPET Parallel with background subtraction in vision In audio, we also need to identify the repetitions! Mixture Signal - 2 4 6 8 2 Zafar RAFII, Spring 22 8

REPET Parallel with background subtraction in vision In audio, we also need to identify the repetitions! Vocal Foreground - 2 4 6 8 2 Musical Background - 2 4 6 8 2 Zafar RAFII, Spring 22 9

amplitude correlation Repeating Period We compute the autocorrelations of the rows of the spectrogram to reveal periodicities Mixture Spectrogram Autocorrelation Plots 2 2 acorr 2 4 6 8 2 2 4 6 8 2 lag (sec) Spectrum at khz Autocorrelation at khz acorr 2 2 4 6 8 2 2 4 6 8 2 lag (sec) Zafar RAFII, Spring 22 2

correlation Repeating Period We take the mean of the autocorrelations (rows) to obtain the beat spectrum 2 Mixture Spectrogram 2 Autocorrelation Plots Beat Spectrum 2 4 6 8 2 acorr 2 4 6 8 2 lag (sec) mean. 2 4 6 8 2 lag (sec) Zafar RAFII, Spring 22 2

Repeating Period The beat spectrum reveals the repeating period p of the underlying repeating structure Mixture Signal - 2 4 6 8 2 Beat Spectrum. p 2 4 6 8 2 lag (sec) Zafar RAFII, Spring 22 22

correlation (khz) frequency (khz) frequency (khz) frequency (khz) frequency Repeating Segment The repeating period is then used to segment the mixture spectrogram at period rate 2 Mixture Spectrogram 2 2 2 (sec) time (sec) time 2 4 6 8 Segmented Spectrogram 8 6 4 2 4 8 2 6 time 4 (sec) 6 8 2 2 4 6 8 8 6 4 2 2 2 2 2. 2 4 6 8 2 Spectrogram Spectrogram Spectrogram. Spectrogram. Beat Spectrum 2 4 6 8 2 lag (sec) Zafar RAFII, Spring 22 23

(khz) frequency (khz) frequency (khz) frequency (khz) frequency Repeating Segment The repeating segment model is calculated as the element-wise median of the segments 2 Mixture Spectrogram 2 2 2 (sec) time (sec) time 2 4 6 8 Segmented Spectrogram 8 6 4 2 4 8 2 6 time 4 (sec) 6 8 2 2 4 6 8 8 6 4 2 Repeating Segment.. 2 2 2 2 2. median. 2 4 6 8 2 Spectrogram Spectrogram Spectrogram. Spectrogram Median.2.4.6.8 Zafar RAFII, Spring 22 24

Repeating Segment The median helps to derive a smooth repeating segment model, removing outliers 2 Mixture Spectrogram. Repeating Segment Segment Model 2 median.2.4.6.8 + energy 2 4 6 8 2.. - energy Zafar RAFII, Spring 22 2

Repeating Structure We take the element-wise min between the repeating segment model and the segments Mixture Spectrogram Repeating Spectrogram 2 (sec) time.. 2 2 min Median 2 4 6 8 2 2 4 6 8 2 Zafar RAFII, Spring 22 26

Repeating Structure We obtain a repeating spectrogram model for the repeating musical background Mixture Spectrogram Repeating Spectrogram 2 (sec) time.. 2 2 2 4 6 8 2 Median min 2 4 6 8 2 Zafar RAFII, Spring 22 27

Repeating Structure The repeating spectrogram model has at most the same values as the mixture spectrogram Mixture Spectrogram Repeating Spectrogram Non-repeating Spectrogram 2 2 2 2 4 6 8 2 2 4 6 8 2 2 4 6 8 2 Zafar RAFII, Spring 22 28

Repeating Structure The repeating spectrogram model is divided by the mixture spectrogram to get a soft mask Mixture Spectrogram Repeating Spectrogram Time-frequency Mask 2 2Mixture Spectrogram 2 2 2 4 6 8 2 2 4 6 8 2 2 4 time 6 8(sec) 2 divides 2 4 6 8 2 Zafar RAFII, Spring 22 29

Repeating Structure In the mask, the more (less) a bin is repeating, the more (less) it is weighted toward () Mixture Spectrogram Mixture Spectrogram Repeating Spectrogram Spectrogram ModelTime-frequency Mask Time-Freq 2 2 2 2 2 median division + + - 2 4 62 84 6 2 28 4 62 2 84 6 2 28 4 62 2 84 6 time - Zafar RAFII, Spring 22 3

Repeating Structure A binary time-frequency mask can be further derived by fixing a threshold between and Mixture Spectrogram Mixture Spectrogram Repeating Spectrogram Spectrogram ModelTime-frequency Mask Time-Freq 2 2 2 2 2 median division + + - 2 4 62 84 6 2 28 4 62 2 84 6 2 28 4 62 2 84 6 time - Zafar RAFII, Spring 22 3

Repeating Structure The mask is then multiplied to the mixture STFT to extract the repeating background STFT 2 Mixture Spectrogram Background Spectrogram 2 Background Signal istft 2 4 6 8 2.x 2 4 6 8 2-2 4 6 8 2 Time-frequency Mask 2 You actually apply the mask on the STFT!!! 2 4 6 8 2 Zafar RAFII, Spring 22 32

Repeating Structure The non-repeating foreground is equal to the mixture minus the repeating background 2 Mixture Spectrogram Background Spectrogram 2 Background Signal istft 2 4 6 8 2 Background Signal _ 2 4 6 8 2 Mixture Signal - 2 4 6 8 2 Foreground Signal - 2 4 6 8 2-2 4 6 8 2-2 4 6 8 2 Zafar RAFII, Spring 22 33

Repeating Structure Repeating background = music Non-repeating foreground = voice Background Signal - Mixture Signal 2 4 6 8 2 REPET. Repeating period 2. Repeating segment 3. Repeating structure - 2 4 6 8 2 Foreground Signal - 2 4 6 8 2 Zafar RAFII, Spring 22 34

State-of-the-Art Music/voice separation systems generally first identify the vocal/non-vocal segments and then use different techniques to separate the musical accompaniment and the lead vocals Non-negative Matrix Factorization (NMF) Accompaniment modeling Pitch-based inference Zafar RAFII, Spring 22 3

State-of-the-Art Non-negative Matrix Factorization (NMF) Iterative factorization of the mixture spectrogram into non-negative additive basic components Limitations Need to know the number of components! Need a proper initialization! Zafar RAFII, Spring 22 36

State-of-the-Art Accompaniment modeling Modeling of the musical accompaniment from the non-vocal segments in the mixture Limitations Need an accurate vocal/non-vocal segmentation! Need a sufficient amount of non-vocal segments! Zafar RAFII, Spring 22 37

State-of-the-Art Pitch-based inference Separation of the vocals using the predominant pitch contour extracted from the vocal segments Limitations Cannot extract unvoiced vocals! Harmonic structure of instruments can interfere! Zafar RAFII, Spring 22 38

Evaluation REPET [Rafii & Pardo, 2] Automatic (simple) period finder Geometrical mean (instead of median) Binary time-frequency masking (not soft) Competitive method [Hsu et al., 2] Pitch-based inference technique Unvoiced vocals separation Voiced vocals enhancement Zafar RAFII, Spring 22 39

Evaluation Data set (MIR-K), song clips (karaoke Chinese pop songs) 4 to 3 seconds for a total of 33 minutes 3 voice-to-music mixing ratios (-,, and db) Zafar RAFII, Spring 22 4

Evaluation Comparative results Global separation performance for the voice using competitive method (Hsu), REPET (Rafii) and the ideal binary mask (Ideal) Zafar RAFII, Spring 22 4

Evaluation Potential enhancements Separation performance for the voice at voice-to-music mixing ratio of db using REPET and successive enhancements Zafar RAFII, Spring 22 42

Evaluation Conclusions REPET can compete with recent (more complex) state-of-the-art music/voice separation methods There is room for improvement: optimal period, optimal tolerance, indices of the vocal frames Average computation time:.26 second for second of mixture (REPET can work in real-time!) Zafar RAFII, Spring 22 43

Audio examples REPET vs. Ozerov (accompaniment modeling) Music estimate (Ozerov) Voice estimate (Ozerov) The Prodigy - Breathe - 2 4 6 8-2 4 6 8-2 4 6 8 Music estimate (REPET) Voice estimate (REPET) - 2 4 6 8-2 4 6 8 Zafar RAFII, Spring 22 44

Audio examples REPET vs. Virtanen (NMF + pitch-based) Music estimate (Virtanen) Voice estimate (Virtanen) Unknown - 2 3 4-2 3 4-2 3 4 Music estimate (REPET) Voice estimate (REPET) - 2 3 4-2 3 4 Zafar RAFII, Spring 22 4

Audio examples REPET vs. FitzGerald (Multi-median-based) Music estimate (FitzGerald) Voice estimate (FitzGerald) Wham! - Freedom - 2 2-2 2-2 2 Music estimate (REPET) Voice estimate (REPET) - 2 2-2 2 Zafar RAFII, Spring 22 46

Audio examples REPET (more examples ) RJD2 - Ghostwriter Background estimate Foreground estimate - - - Rebecca Black - Friday Background estimate Foreground estimate - 2 2-2 2-2 2 Zafar RAFII, Spring 22 47

frequency Future REPET is very effective on short excerpts with a relatively stable repeating background -2 seconds similar repetitions fixed period rate Underlying Repeating Structure p 2p 3p 4p p 6p 7p 8p 9p time Zafar RAFII, Spring 22 48

frequency Future REPET is more likely to show limitations with full-track musical pieces Varying repeating background (e.g. verse/chorus) Varying period rate (i.e. varying tempo) Underlying Repeating Structure p 2p 3p 4p p 2 2p 2 3p 2 time Zafar RAFII, Spring 22 49

frequency Future REPET for varying repeating structure! [Liutkus, Rafii, Badeau, Pardo, Richard, 22]. Identify local periods using a beat spectrogram 2. Model local models using a median filtering 3. Extract the repeating structure using a t-f mask Underlying Repeating Structure p 2p 3p 4p p 2 2p 2 3p 2 time Zafar RAFII, Spring 22

Step 2 Step Step 3 2 3 4 6 Future Mixture Signal x Mixture Spectrogram V Beat Spectrogram B.8.6.4.2 -.2 -.4 -.6 -.8 -.. 2 2. 3 3. 4 4.. 6 V 2 2 3 3 4 4 i-p 2 3 4 6 i i i+p i V Filtered Spectrogram S 2 2 3 3 2 4 2 4 3 3 2 3 4 6 4 4 min 2 3 4 6 2 2 3 3 4 4 2 2 3 3 4 4. 3 i 2 3 4 6 2 4 6 2 2 2 2 3 3 3 3 4 4 4 4 Median 2 2 3 3 4 4 2 23 34 4 6 6 i-p i 2 i i+p 3 i 4 6.. 2 2. 3 3. 4 4. Filtered Spectrogram S 2 2 3 3 4 4 Repeating Spectrogram W Time-Frequency Mask M 2 2 3 3 4 4 2 3 4 6 2 3 4 6 Zafar RAFII, Spring 22 i p i

Conclusions REpeating Pattern Extraction Technique. Identify the repeating period 2. Model the repeating segment 3. Extract the repeating structure Simple music/voice separation method Can be applied for music/voice separation Can compete with state-of-the-art methods Still room for improvement Zafar RAFII, Spring 22 2

Thank you! Zafar RAFII, Spring 22 3

References M. Piccardi, Background Subtraction Techniques: a Review, IEEE International Conference on Systems, Man and Cybernetics, The Hague, Netherlands, October -3, 24. A. Ozerov, P. Philippe, F. Bimbot, and R. Gribonval, Adaptation of Bayesian Models for Single-Channel Source Separation and its Application to Voice/Music Separation in Popular Songs, IEEE Transactions on Audio, Speech, and Language Processing, vol., no., pp. 64-78, July 27. T. Virtanen, A. Mesaros, and M. Ryynänen, Combining Pitch-based Inference and Non-Negative Spectrogram Factorization in Separating Vocals from Polyphonic Music, ISCA Tutorial and Research Workshop on Statistical and Perceptual Audition, Brisbane, Australia, pp. 7-2, September 2, 28. C.-L. Hsu and J.S. R. Jang, On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-K Dataset, IEEE Transactions on Audio, Speech, and Language Processing, vol. 8, no. 2, pp. 3-39, February 2. D. FitzGerald and M. Gainza, Single Channel Vocal Separation using Median Filtering and Factorisation Techniques, ISAST Transactions on Electronic and Signal Processing, vol. 4, no., pp. 62-73, 2. Z. Rafii and B. Pardo, A Simple Music/Voice Separation Method based on the Extraction of the Underlying Repeating Structure, in IEEE International Conference on Acoustics, Speech and Signal Processing, Prague, Czech Republic, May 22-27, 2. A. Liutkus, Z. Rafii, R. Badeau, B. Pardo, and G. Richard, Adaptive Filtering for Music/Voice Separation exploiting the Repeating Musical Structure, in IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, March 2-3, 22. Zafar RAFII, Spring 22 4