Detection Lecture usic Processing Applications of usic Processing Christian Dittmar International Audio Laboratories Erlangen christian.dittmar@audiolabs-erlangen.de Important pre-requisite for: usic segmentation usic thumbnailing (preview version) Singing voice transcription Singing voice separation Lyrics alignment Lyrics recognition Detection Detection Detect singing voice activity during course of a recording Assumptions: Real-world, polyphonic music recordings are analyzed Singing voice performs dominant melody above accompaniment Challenges: Complex characteristics of singing voice Large diversity of accompaniment music may play same melody as singing Pitch-fluctuating instruments my be similar to singing 5 2 25 3 35 4 45 Time in seconds Stable pitch Fluctuating pitch Detection Audio Feature Extraction Common approach: Frame-wise extraction of audio features Classification via machine learning Frame-wise processing: Hopsize Q Blocksize K Window function w(n) Signal frame x(n) 5 2 25 3 35 4 45 Time in seconds Compute for each analysis frame: Time-domain features features Cepstral feature others
Audio Feature Extraction Time-domain features: Zero Crossing Rate (ZCR) High-pitched vs. Low-pitched Audio Feature Extraction features:, linear vs. logarithmic frequency spacing Flatness (SF), Centroid (SC), and many others STFT [db] Gabor Wavelet [db] Linear Prediction Coeff. (LPC) Encodes spectral envelope [Hz] 9 8 7 6 5 4 3 [Hz] 887 635 46 2794 9 293 88 599 2.5.5 2 Time [Sec] 47 277.5.5 2 Time [Sec] Audio Feature Extraction achine Learning Cepstral features: Singing voice as an example Convolutive: excitation * filter Excitation: vibration of vocal folds Filter: resonance of the vocal tract agnitude spectrum ultiplicative: excitation filter Log-magnitude spectrum Additive: excitation + filter Liftering Separation into smooth spectral envelope and fine-structured excitation agnitude spectrum Logarithmic magnitude Extraction of spectral envelope via cepstral liftering Observed Spectrum Envelope Excitation Spectrum.5.5 2 2.5 x 4.5.5 2 2.5 (Hz) x 4 Application to audio signals: Speech recognition Speaker recognition Singing voice detection Genre classification Instrument recognition Chord recognition etc achine Learning The Feature Space Learning principles: Unsupervised learning Find structures in data Supervised learning Human observer provides ground truth Semi-supervised learning Combination of above principles Reinforcement learning Feedback of confident classifications to the training Geometric and algebraic interpretation of L problems Features contain numerical values Concatenation of several features Dimensionality The data set contains N observations Cardinality N Illustrative Example SF & SCF of 6 complex tones K SF K K k K k k s k s SC K f k sk k K sk k
The Feature Space The Feature Space Each feature has one value =2 Number of observations N=6 lpnoisetone.wav noisetone.wav hpnoisetone.wav harmonicnoise.wav Centroid Flatness 258.62.59 52.73.99 55.3.92 46.5.27 Each feature has one value =2 Number of observations N=6 apping of features SC to y-axis SF to x-axis Scatter plot with unnormalized axes Centroid 6 5 4 3 2 Scatter plot of Flatness vs. Centroid lpnoisetone.wav noisetone.wav hpnoisetone.wav harmonicnoisetone.wav pianotone.wav harmonictone.wav pianotone.wav 47.93. harmonictone.wav N 43.95...2.3.4.5.6.7.8.9 Flatness The Feature Space Each feature has one value =2 Number of observations N=6 apping of features SC to y-axis SF to x-axis Scatter plot with unnormalized axes Target class labels Provided by manual annotation Target Labels Centroid Flatness 258.62.59 52.73.99 55.3.92 46.5.27 47.93. 43.95. k-nearest Neighbours (knn) k-nearest Neighbours (knn) Decision Trees (DT) L-Dist. (anhattan) d x m y m m L2-Dist. (Euclidean) 2 d 2 x m y m m L -Dist. (aximum) d max x y,, x y
Random Forests (RF) Gaussian ixture odels (G) Σ Gaussian ixture odels (G) Support Vector achines (SV) sgn, Gauss components Deep Neural Networks (DNN) Deep Neural Networks (DNN),, Loss function
Further methods: Hidden arkov odels Transition probabilities between Gs Sparse Representation Classifier Sparse linear combination of training data Boosting Combine many weak classifiers Convolutional Neural Networks Recurrent Neural Networks ultiple Kernel Learning others Detection el-scale Cepstral Coefficients Frame Filter Bank x t Segment-by-Segment Classification V V V N N N N N N N N N N V V V V N N Singing W log p( x i S ) W log p( x ) tw i tw i i Gaussian ixture odel (G) x, Σ N (), Σ 2 2... N (), Σ G G w w 2 w G + p( x ) 25 Audio osaicing NF-Inspired Audio osaicing Target signal: Beatles Let it be Source signal: Bees Non-negative matrix factorization (NF) [Driedger et al. ISIR 25] Non-negative matrix Components Activations. = fixed learned learned Proposed audio ing approach Target s spectrogram Source s spectrogram Activations osaic s spectrogram osaic signal: Let it Bee fixed fixed learned Basic NF-Inspired Audio osaicing Basic NF-Inspired Audio osaicing Iterative updates Preserve temporal context Core idea: support the development of sparse diagonal activation structures
Basic NF-Inspired Audio osaicing Basic NF-Inspired Audio osaicing Audio osaicing Audio osaicing Target signal: Chic Good times Source signal: Whales Target signal: Adele Rolling in the Deep Source signal: Race car osaic signal osaic signal https://www.audiolabs-erlangen.de/res/ir/25-isir-letitbee https://www.audiolabs-erlangen.de/res/ir/25-isir-letitbee Drum Source Separation Drum Source Separation Signal odel STFT VV V V istft Relative amplitude Log-frequency V V V 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8
Drum Sound Separation Decomposition via NFD Score-based information (drum notation) Drum Sound Separation Log-frequency W U Rows of H U U Relative Log-frequency amplitude.5.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 Audio-based information (training Lateral drum slices from sounds) W https://www.audiolabs-erlangen.de/res/ir/26-ieee-taslp-drumseparation