Lecture Music Processing Applications of Music Processing Christian Dittmar International Audio Laboratories Erlangen christian.dittmar@audiolabs-erlangen.de
Singing Voice Detection Important pre-requisite for: Music segmentation Music thumbnailing (preview version) Singing voice transcription Singing voice separation Lyrics alignment Lyrics recognition
Singing Voice Detection Detect singing voice activity during course of a recording Assumptions: Real-world, polyphonic music recordings are analyzed Singing voice performs dominant melody above accompaniment 10 15 20 25 30 35 40 45 Time in seconds
Singing Voice Detection Challenges: Complex characteristics of singing voice Large diversity of accompaniment music Accompaniment may play same melody as singing Pitch-fluctuating instruments my be similar to singing Stable pitch Fluctuating pitch
Singing Voice Detection Common approach: Frame-wise extraction of audio features Classification via machine learning 10 15 20 25 30 35 40 45 Time in seconds
Audio Feature Extraction Frame-wise processing: Hopsize Q Blocksize K Window function w(n) Signal frame x(n) Compute for each analysis frame: Time-domain features Spectral features Cepstral feature others
Audio Feature Extraction Time-domain features: Zero Crossing Rate (ZCR) High-pitched vs. Low-pitched Linear Prediction Coeff. (LPC) Encodes spectral envelope
Audio Feature Extraction Spectral features: Spectrogram, linear vs. logarithmic frequency spacing Spectral Flatness (SF), Spectral Centroid (SC), and many others STFT Spectrogram [db] Gabor Wavelet Spectrogram [db] 11000 8870 Frequency [Hz] 10000 9000 8000 7000 6000 5000 4000 Frequency [Hz] 6035 4106 2794 1901 1293 880 3000 2000 1000 0.5 1 1.5 2 Time [Sec] 599 407 277 0.5 1 1.5 2 Time [Sec]
Audio Feature Extraction Cepstral features: Singing voice as an example Convolutive: excitation * filter Excitation: vibration of vocal folds Filter: resonance of the vocal tract Magnitude spectrum Multiplicative: excitation filter Log-magnitude spectrum Additive: excitation + filter Liftering Separation into smooth spectral envelope and fine-structured excitation Magnitude spectrum Logarithmic magnitude Extraction of spectral envelope via cepstral liftering Observed Spectrum Spectral Envelope Excitation Spectrum 0 0.5 1 1.5 2 2.5 x 10 4 0 0.5 1 1.5 2 2.5 Frequency (Hz) x 10 4
Machine Learning Application to audio signals: Speech recognition Speaker recognition Singing voice detection Genre classification Instrument recognition Chord recognition etc
Machine Learning Learning principles: Unsupervised learning Find structures in data Supervised learning Human observer provides ground truth Semi-supervised learning Combination of above principles Reinforcement learning Feedback of confident classifications to the training
The Feature Space Geometric and algebraic interpretation of ML problems Features contain numerical values Concatenation of several features Dimensionality M The data set contains N observations Cardinality N Illustrative Example SFM & SCF of 6 complex tones SF K 1 K K 1 k 0 K 1 k 0 s k s k SC K 1 k 0 f K 1 k 0 k sk sk
The Feature Space Each feature has one value M=2 Number of observations N=6 Spectral Centroid Spectral Flatness M lpnoisetone.wav 258.62 0.59 noisetone.wav 512.73 0.99 hpnoisetone.wav 550.13 0.92 harmonicnoise.wav 146.50 0.27 pianotone.wav 47.93 0.01 harmonictone.wav N 43.95 0.01
The Feature Space Each feature has one value M=2 Number of observations N=6 Mapping of features SC to y-axis SF to x-axis Scatter plot with unnormalized axes Spectral Centroid 600 500 400 300 200 Scatter plot of Spectral Flatness vs. Spectral Centroid lpnoisetone.wav noisetone.wav hpnoisetone.wav harmonicnoisetone.wav pianotone.wav harmonictone.wav 100 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Spectral Flatness
The Feature Space Each feature has one value M=2 Number of observations N=6 Mapping of features Target Labels Spectral Centroid Spectral Flatness SC to y-axis SF to x-axis Scatter plot with unnormalized axes Target class labels 0 0 0 1 258.62 0.59 512.73 0.99 550.13 0.92 146.50 0.27 Provided by manual annotation 1 1 47.93 0.01 43.95 0.01
Classification methods k-nearest Neighbours (knn) Singing Voice Accompaniment Unknown data
Classification methods k-nearest Neighbours (knn) Singing Voice Accompaniment Unknown data L1-Dist. (Manhattan) d 1 M m1 x m y m L2-Dist. (Euclidean) d M 2 m1 x m y m 2 d L -Dist. (Maximum) max x y1, 1, x M y M
Classification methods Decision Trees (DT) Singing Voice Accompaniment Unknown data
Classification methods Random Forests (RF) Singing Voice Accompaniment Unknown data
Classification methods Gaussian Mixture Models (GMM) Singing Voice Accompaniment Unknown data Σ
Classification methods Gaussian Mixture Models (GMM) Singing Voice Accompaniment Unknown data Gauss components
Classification methods Support Vector Machines (SVM) Singing Voice Accompaniment Unknown data sgn,
Classification methods Deep Neural Networks (DNN) Singing Voice Accompaniment Unknown data,,
Classification methods Deep Neural Networks (DNN) Singing Voice Accompaniment Unknown data Loss function
Classification methods Further methods: Hidden Markov Models Transition probabilities between GMMs Sparse Representation Classifier Sparse linear combination of training data Boosting Combine many weak classifiers Convolutional Neural Networks Recurrent Neural Networks Multiple Kernel Learning others 25
Singing Voice Detection Mel-scale Frequency Cepstral Coefficients Frame Filter Bank x t Gaussian Mixture Model (GMM) x, Σ 1 1, Σ 2 2..., Σ G G N () N () Segment-by-Segment Classification V V V V V V V N N N N N N N N N N N N W 1 i0 log p( x tw i S ) Singing W 1 i0 Accompaniment log p( x tw i M ) w w 2 w G 1 + p( x )
Audio Mosaicing Target signal: Beatles Let it be Source signal: Bees Mosaic signal: Let it Bee
NMF-Inspired Audio Mosaicing Non-negative matrix factorization (NMF) [Driedger et al. ISMIR 2015] Non-negative matrix Components Activations. = fixed learned learned Proposed audio mosaicing approach Target s spectrogram Source s spectrogram Activations Mosaic s spectrogram Frequency. = Time source Frequency fixed Time source fixed Time target learned Time target
Basic NMF-Inspired Audio Mosaicing Spectrogram target Spectrogram source Activation matrix Spectrogram mosaic Frequency Frequency Time source. = Frequency Time target Time source Time target Time target
Basic NMF-Inspired Audio Mosaicing Spectrogram target Spectrogram source Iterative updates Activation matrix Spectrogram mosaic Frequency Frequency Time source. = Frequency Time target Time source Time target Time target Preserve temporal context Core idea: support the development of sparse diagonal activation structures
Basic NMF-Inspired Audio Mosaicing Spectrogram target Spectrogram source Activation matrix Spectrogram mosaic Frequency Frequency Time source. = Frequency Time target Time source Time target Time target
Basic NMF-Inspired Audio Mosaicing Spectrogram target Spectrogram source Activation matrix Spectrogram mosaic Frequency Frequency Time source. = Frequency Time target Time source Time target Time target
Audio Mosaicing Target signal: Chic Good times Source signal: Whales Mosaic signal https://www.audiolabs-erlangen.de/resources/mir/2015-ismir-letitbee
Audio Mosaicing Target signal: Adele Rolling in the Deep Source signal: Race car Mosaic signal https://www.audiolabs-erlangen.de/resources/mir/2015-ismir-letitbee
Drum Source Separation
Drum Source Separation Signal Model STFT VV V V istft Relative amplitude Log-frequency V V V 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 Time (seconds) Time (seconds)
Drum Sound Separation Decomposition via NMFD Score-based information (drum notation) W Rows of H Log-frequency U U U Audio-based information (training drum sounds) Lateral slices from W Time (seconds)
Drum Sound Separation Relative Log-frequency amplitude 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 Time (seconds) https://www.audiolabs-erlangen.de/resources/mir/2016-ieee-taslp-drumseparation