REpeating Pattern Extraction Technique (REPET)

REpeating Pattern Extraction Technique (REPET) EECS 32: Machine Perception of Music & Audio Zafar RAFII, Spring 22

Repetition Repetition is a fundamental element in generating and perceiving structure Propellerheads - History Repeating - 2 4 6 8 2 Zafar RAFII, Spring 22 2

Repetition Repetition is a fundamental element in generating and perceiving structure Propellerheads - History Repeating - 2 4 6 8 2 Zafar RAFII, Spring 22 3

Repetition Repetitions happen in audio in general Music Repetitive noises Auditory grouping etc. Zafar RAFII, Spring 22 4

Repetition Repetitions happen in art in general Painting Sculpture Architecture etc. Zafar RAFII, Spring 22

Repetition Repetitions happen in nature in general Animals Plants Objects etc. Zafar RAFII, Spring 22 6

Repetition Musical pieces are generally characterized by an underlying repeating structure over which varying elements are superimposed Propellerheads - History Repeating - 2 4 6 8 2 Zafar RAFII, Spring 22 7

Repetition This means there should be patterns that are more or less repeating in time and frequency Mixture Spectrogram. 2...9.9... 2 4 6 8 2 High energy Low energy Zafar RAFII, Spring 22 8

Repetition The (more or less) repeating patterns could be identified using a time-frequency mask Time-Frequency Mask. 2...9.9... 2 4 6 8 2 = +repeating = -repeating Zafar RAFII, Spring 22 9

Repetition The mask could be applied on the mixture to extract the (more or less) repeating patterns Repeating Spectrogram. 2...9.9... 2 4 6 8 2 High energy Low energy Zafar RAFII, Spring 22

Repetition REpeating Pattern Extraction Technique!. Identify the repeating period 2. Model the repeating segment 3. Extract the repeating structure Simple music/voice separation method! Repeating structure = musical background Non-repeating structure = vocal foreground Zafar RAFII, Spring 22

Step 3 Step 2 Step REPET Mixture Signal x Mixture Spectrogram V Beat Spectrum b.8.9.6.8.4.2 -.2 2 2 3 3.7.6..4 -.4 4.3 -.6 4.2 -.8. -.. 2 2. 3 3. 4 4.. 6 2 3 4 6 p 2 3 4 6 V Median Repeating Segment S 2 2 3 3 4 4 p 2 3 2p 4 6 2 2 3 3 4 4 2 2 2 2 3 3 3 3 4 4 4 4.2.4.6.8.2.4.6.8 2.2.4.6.8.2.4.6.8 2.2.4.6.8.2.4.6.8 2 2 2 3 3 4 4.2.4.6.8.2.4.6.8 2 S V Repeating Spectrogram W Time-Frequency Mask M 2 2 3 2 3 2 4 3 4 3 4 2 3 4 6 4.2.4.6.8.2.4.6.8 2 min min min 2 2 3 3 4 4 2 2 3 3 4 4 2 3 4 6 2 3 4 6 Zafar RAFII, Spring 22 2

Practical Advantages Not feature-dependent Does not rely on complex frameworks Does not require prior training Zafar RAFII, Spring 22 3

Practical Interests Instrument/vocalist identification Pitch/melody transcription Karaoke gaming Zafar RAFII, Spring 22 4

Intellectual Interests Music understanding Music perception Simply based on repetition! Zafar RAFII, Spring 22

REPET Parallel with background subtraction in vision Compare frames to estimate a background model Zafar RAFII, Spring 22 6

REPET Parallel with background subtraction in vision Extract the background from the foreground Zafar RAFII, Spring 22 7

REPET Parallel with background subtraction in vision In audio, we also need to identify the repetitions! Mixture Signal - 2 4 6 8 2 Zafar RAFII, Spring 22 8

REPET Parallel with background subtraction in vision In audio, we also need to identify the repetitions! Vocal Foreground - 2 4 6 8 2 Musical Background - 2 4 6 8 2 Zafar RAFII, Spring 22 9

amplitude correlation Repeating Period We compute the autocorrelations of the rows of the spectrogram to reveal periodicities Mixture Spectrogram Autocorrelation Plots 2 2 acorr 2 4 6 8 2 2 4 6 8 2 lag (sec) Spectrum at khz Autocorrelation at khz acorr 2 2 4 6 8 2 2 4 6 8 2 lag (sec) Zafar RAFII, Spring 22 2

correlation Repeating Period We take the mean of the autocorrelations (rows) to obtain the beat spectrum 2 Mixture Spectrogram 2 Autocorrelation Plots Beat Spectrum 2 4 6 8 2 acorr 2 4 6 8 2 lag (sec) mean. 2 4 6 8 2 lag (sec) Zafar RAFII, Spring 22 2

Repeating Period The beat spectrum reveals the repeating period p of the underlying repeating structure Mixture Signal - 2 4 6 8 2 Beat Spectrum. p 2 4 6 8 2 lag (sec) Zafar RAFII, Spring 22 22

correlation (khz) frequency (khz) frequency (khz) frequency (khz) frequency Repeating Segment The repeating period is then used to segment the mixture spectrogram at period rate 2 Mixture Spectrogram 2 2 2 (sec) time (sec) time 2 4 6 8 Segmented Spectrogram 8 6 4 2 4 8 2 6 time 4 (sec) 6 8 2 2 4 6 8 8 6 4 2 2 2 2 2. 2 4 6 8 2 Spectrogram Spectrogram Spectrogram. Spectrogram. Beat Spectrum 2 4 6 8 2 lag (sec) Zafar RAFII, Spring 22 23

(khz) frequency (khz) frequency (khz) frequency (khz) frequency Repeating Segment The repeating segment model is calculated as the element-wise median of the segments 2 Mixture Spectrogram 2 2 2 (sec) time (sec) time 2 4 6 8 Segmented Spectrogram 8 6 4 2 4 8 2 6 time 4 (sec) 6 8 2 2 4 6 8 8 6 4 2 Repeating Segment.. 2 2 2 2 2. median. 2 4 6 8 2 Spectrogram Spectrogram Spectrogram. Spectrogram Median.2.4.6.8 Zafar RAFII, Spring 22 24

Repeating Segment The median helps to derive a smooth repeating segment model, removing outliers 2 Mixture Spectrogram. Repeating Segment Segment Model 2 median.2.4.6.8 + energy 2 4 6 8 2.. - energy Zafar RAFII, Spring 22 2

Repeating Structure We take the element-wise min between the repeating segment model and the segments Mixture Spectrogram Repeating Spectrogram 2 (sec) time.. 2 2 min Median 2 4 6 8 2 2 4 6 8 2 Zafar RAFII, Spring 22 26

Repeating Structure We obtain a repeating spectrogram model for the repeating musical background Mixture Spectrogram Repeating Spectrogram 2 (sec) time.. 2 2 2 4 6 8 2 Median min 2 4 6 8 2 Zafar RAFII, Spring 22 27

Repeating Structure The repeating spectrogram model has at most the same values as the mixture spectrogram Mixture Spectrogram Repeating Spectrogram Non-repeating Spectrogram 2 2 2 2 4 6 8 2 2 4 6 8 2 2 4 6 8 2 Zafar RAFII, Spring 22 28

Repeating Structure The repeating spectrogram model is divided by the mixture spectrogram to get a soft mask Mixture Spectrogram Repeating Spectrogram Time-frequency Mask 2 2Mixture Spectrogram 2 2 2 4 6 8 2 2 4 6 8 2 2 4 time 6 8(sec) 2 divides 2 4 6 8 2 Zafar RAFII, Spring 22 29

Repeating Structure In the mask, the more (less) a bin is repeating, the more (less) it is weighted toward () Mixture Spectrogram Mixture Spectrogram Repeating Spectrogram Spectrogram ModelTime-frequency Mask Time-Freq 2 2 2 2 2 median division + + - 2 4 62 84 6 2 28 4 62 2 84 6 2 28 4 62 2 84 6 time - Zafar RAFII, Spring 22 3

Repeating Structure A binary time-frequency mask can be further derived by fixing a threshold between and Mixture Spectrogram Mixture Spectrogram Repeating Spectrogram Spectrogram ModelTime-frequency Mask Time-Freq 2 2 2 2 2 median division + + - 2 4 62 84 6 2 28 4 62 2 84 6 2 28 4 62 2 84 6 time - Zafar RAFII, Spring 22 3

Repeating Structure The mask is then multiplied to the mixture STFT to extract the repeating background STFT 2 Mixture Spectrogram Background Spectrogram 2 Background Signal istft 2 4 6 8 2.x 2 4 6 8 2-2 4 6 8 2 Time-frequency Mask 2 You actually apply the mask on the STFT!!! 2 4 6 8 2 Zafar RAFII, Spring 22 32

Repeating Structure The non-repeating foreground is equal to the mixture minus the repeating background 2 Mixture Spectrogram Background Spectrogram 2 Background Signal istft 2 4 6 8 2 Background Signal _ 2 4 6 8 2 Mixture Signal - 2 4 6 8 2 Foreground Signal - 2 4 6 8 2-2 4 6 8 2-2 4 6 8 2 Zafar RAFII, Spring 22 33

Repeating Structure Repeating background = music Non-repeating foreground = voice Background Signal - Mixture Signal 2 4 6 8 2 REPET. Repeating period 2. Repeating segment 3. Repeating structure - 2 4 6 8 2 Foreground Signal - 2 4 6 8 2 Zafar RAFII, Spring 22 34

State-of-the-Art Music/voice separation systems generally first identify the vocal/non-vocal segments and then use different techniques to separate the musical accompaniment and the lead vocals Non-negative Matrix Factorization (NMF) Accompaniment modeling Pitch-based inference Zafar RAFII, Spring 22 3

State-of-the-Art Non-negative Matrix Factorization (NMF) Iterative factorization of the mixture spectrogram into non-negative additive basic components Limitations Need to know the number of components! Need a proper initialization! Zafar RAFII, Spring 22 36

State-of-the-Art Accompaniment modeling Modeling of the musical accompaniment from the non-vocal segments in the mixture Limitations Need an accurate vocal/non-vocal segmentation! Need a sufficient amount of non-vocal segments! Zafar RAFII, Spring 22 37

State-of-the-Art Pitch-based inference Separation of the vocals using the predominant pitch contour extracted from the vocal segments Limitations Cannot extract unvoiced vocals! Harmonic structure of instruments can interfere! Zafar RAFII, Spring 22 38

Evaluation REPET [Rafii & Pardo, 2] Automatic (simple) period finder Geometrical mean (instead of median) Binary time-frequency masking (not soft) Competitive method [Hsu et al., 2] Pitch-based inference technique Unvoiced vocals separation Voiced vocals enhancement Zafar RAFII, Spring 22 39

Evaluation Data set (MIR-K), song clips (karaoke Chinese pop songs) 4 to 3 seconds for a total of 33 minutes 3 voice-to-music mixing ratios (-,, and db) Zafar RAFII, Spring 22 4

Evaluation Comparative results Global separation performance for the voice using competitive method (Hsu), REPET (Rafii) and the ideal binary mask (Ideal) Zafar RAFII, Spring 22 4

Evaluation Potential enhancements Separation performance for the voice at voice-to-music mixing ratio of db using REPET and successive enhancements Zafar RAFII, Spring 22 42

Evaluation Conclusions REPET can compete with recent (more complex) state-of-the-art music/voice separation methods There is room for improvement: optimal period, optimal tolerance, indices of the vocal frames Average computation time:.26 second for second of mixture (REPET can work in real-time!) Zafar RAFII, Spring 22 43

Audio examples REPET vs. Ozerov (accompaniment modeling) Music estimate (Ozerov) Voice estimate (Ozerov) The Prodigy - Breathe - 2 4 6 8-2 4 6 8-2 4 6 8 Music estimate (REPET) Voice estimate (REPET) - 2 4 6 8-2 4 6 8 Zafar RAFII, Spring 22 44

Audio examples REPET vs. Virtanen (NMF + pitch-based) Music estimate (Virtanen) Voice estimate (Virtanen) Unknown - 2 3 4-2 3 4-2 3 4 Music estimate (REPET) Voice estimate (REPET) - 2 3 4-2 3 4 Zafar RAFII, Spring 22 4

Audio examples REPET vs. FitzGerald (Multi-median-based) Music estimate (FitzGerald) Voice estimate (FitzGerald) Wham! - Freedom - 2 2-2 2-2 2 Music estimate (REPET) Voice estimate (REPET) - 2 2-2 2 Zafar RAFII, Spring 22 46

Audio examples REPET (more examples ) RJD2 - Ghostwriter Background estimate Foreground estimate - - - Rebecca Black - Friday Background estimate Foreground estimate - 2 2-2 2-2 2 Zafar RAFII, Spring 22 47

frequency Future REPET is very effective on short excerpts with a relatively stable repeating background -2 seconds similar repetitions fixed period rate Underlying Repeating Structure p 2p 3p 4p p 6p 7p 8p 9p time Zafar RAFII, Spring 22 48

frequency Future REPET is more likely to show limitations with full-track musical pieces Varying repeating background (e.g. verse/chorus) Varying period rate (i.e. varying tempo) Underlying Repeating Structure p 2p 3p 4p p 2 2p 2 3p 2 time Zafar RAFII, Spring 22 49

frequency Future REPET for varying repeating structure! [Liutkus, Rafii, Badeau, Pardo, Richard, 22]. Identify local periods using a beat spectrogram 2. Model local models using a median filtering 3. Extract the repeating structure using a t-f mask Underlying Repeating Structure p 2p 3p 4p p 2 2p 2 3p 2 time Zafar RAFII, Spring 22

Step 2 Step Step 3 2 3 4 6 Future Mixture Signal x Mixture Spectrogram V Beat Spectrogram B.8.6.4.2 -.2 -.4 -.6 -.8 -.. 2 2. 3 3. 4 4.. 6 V 2 2 3 3 4 4 i-p 2 3 4 6 i i i+p i V Filtered Spectrogram S 2 2 3 3 2 4 2 4 3 3 2 3 4 6 4 4 min 2 3 4 6 2 2 3 3 4 4 2 2 3 3 4 4. 3 i 2 3 4 6 2 4 6 2 2 2 2 3 3 3 3 4 4 4 4 Median 2 2 3 3 4 4 2 23 34 4 6 6 i-p i 2 i i+p 3 i 4 6.. 2 2. 3 3. 4 4. Filtered Spectrogram S 2 2 3 3 4 4 Repeating Spectrogram W Time-Frequency Mask M 2 2 3 3 4 4 2 3 4 6 2 3 4 6 Zafar RAFII, Spring 22 i p i

Conclusions REpeating Pattern Extraction Technique. Identify the repeating period 2. Model the repeating segment 3. Extract the repeating structure Simple music/voice separation method Can be applied for music/voice separation Can compete with state-of-the-art methods Still room for improvement Zafar RAFII, Spring 22 2

Thank you! Zafar RAFII, Spring 22 3

References M. Piccardi, Background Subtraction Techniques: a Review, IEEE International Conference on Systems, Man and Cybernetics, The Hague, Netherlands, October -3, 24. A. Ozerov, P. Philippe, F. Bimbot, and R. Gribonval, Adaptation of Bayesian Models for Single-Channel Source Separation and its Application to Voice/Music Separation in Popular Songs, IEEE Transactions on Audio, Speech, and Language Processing, vol., no., pp. 64-78, July 27. T. Virtanen, A. Mesaros, and M. Ryynänen, Combining Pitch-based Inference and Non-Negative Spectrogram Factorization in Separating Vocals from Polyphonic Music, ISCA Tutorial and Research Workshop on Statistical and Perceptual Audition, Brisbane, Australia, pp. 7-2, September 2, 28. C.-L. Hsu and J.S. R. Jang, On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-K Dataset, IEEE Transactions on Audio, Speech, and Language Processing, vol. 8, no. 2, pp. 3-39, February 2. D. FitzGerald and M. Gainza, Single Channel Vocal Separation using Median Filtering and Factorisation Techniques, ISAST Transactions on Electronic and Signal Processing, vol. 4, no., pp. 62-73, 2. Z. Rafii and B. Pardo, A Simple Music/Voice Separation Method based on the Extraction of the Underlying Repeating Structure, in IEEE International Conference on Acoustics, Speech and Signal Processing, Prague, Czech Republic, May 22-27, 2. A. Liutkus, Z. Rafii, R. Badeau, B. Pardo, and G. Richard, Adaptive Filtering for Music/Voice Separation exploiting the Repeating Musical Structure, in IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, March 2-3, 22. Zafar RAFII, Spring 22 4